788 Commits

Author SHA1 Message Date
Linus Torvalds
2cd5769fb0 Driver core updates for 6.15-rc1
Here is the big set of driver core updates for 6.15-rc1.  Lots of stuff
 happened this development cycle, including:
   - kernfs scaling changes to make it even faster thanks to rcu
   - bin_attribute constify work in many subsystems
   - faux bus minor tweaks for the rust bindings
   - rust binding updates for driver core, pci, and platform busses,
     making more functionaliy available to rust drivers.  These are all
     due to people actually trying to use the bindings that were in 6.14.
   - make Rafael and Danilo full co-maintainers of the driver core
     codebase
   - other minor fixes and updates.
 
 This has been in linux-next for a while now, with the only reported
 issue being some merge conflicts with the rust tree.  Depending on which
 tree you pull first, you will have conflicts in one of them.  The merge
 resolution has been in linux-next as an example of what to do, or can be
 found here:
 	https://lore.kernel.org/r/CANiq72n3Xe8JcnEjirDhCwQgvWoE65dddWecXnfdnbrmuah-RQ@mail.gmail.com
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ+mMrg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylRgwCdH58OE3BgL0uoFY5vFImStpmPtqUAoL5HpVWI
 jtbJ+UuXGsnmO+JVNBEv
 =gy6W
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updatesk from Greg KH:
 "Here is the big set of driver core updates for 6.15-rc1. Lots of stuff
  happened this development cycle, including:

   - kernfs scaling changes to make it even faster thanks to rcu

   - bin_attribute constify work in many subsystems

   - faux bus minor tweaks for the rust bindings

   - rust binding updates for driver core, pci, and platform busses,
     making more functionaliy available to rust drivers. These are all
     due to people actually trying to use the bindings that were in
     6.14.

   - make Rafael and Danilo full co-maintainers of the driver core
     codebase

   - other minor fixes and updates"

* tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (52 commits)
  rust: platform: require Send for Driver trait implementers
  rust: pci: require Send for Driver trait implementers
  rust: platform: impl Send + Sync for platform::Device
  rust: pci: impl Send + Sync for pci::Device
  rust: platform: fix unrestricted &mut platform::Device
  rust: pci: fix unrestricted &mut pci::Device
  rust: device: implement device context marker
  rust: pci: use to_result() in enable_device_mem()
  MAINTAINERS: driver core: mark Rafael and Danilo as co-maintainers
  rust/kernel/faux: mark Registration methods inline
  driver core: faux: only create the device if probe() succeeds
  rust/faux: Add missing parent argument to Registration::new()
  rust/faux: Drop #[repr(transparent)] from faux::Registration
  rust: io: fix devres test with new io accessor functions
  rust: io: rename `io::Io` accessors
  kernfs: Move dput() outside of the RCU section.
  efi: rci2: mark bin_attribute as __ro_after_init
  rapidio: constify 'struct bin_attribute'
  firmware: qemu_fw_cfg: constify 'struct bin_attribute'
  powerpc/perf/hv-24x7: Constify 'struct bin_attribute'
  ...
2025-04-01 11:02:03 -07:00
Linus Torvalds
32b22538be Scheduler updates for v6.15:
[ Merge note, these two commits are identical:
 
    - f3fa0e40df17 ("sched/clock: Don't define sched_clock_irqtime as static key")
    - b9f2b29b9494 ("sched: Don't define sched_clock_irqtime as static key")
 
   The first one is a cherry-picked version of the second, and the first one
   is already upstream. ]
 
 Core & fair scheduler changes:
 
   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun Dong)
 
 Deadline scheduler: (Juri Lelli)
 
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h
 
 Uclamp:
 
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)
 
 Scheduler topology support: (Juri Lelli)
 
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked
 
 RSEQ: (Michael Jeanson)
 
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes
 
 Membarriers:
 
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)
 
 Scheduler debugging:
 
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)
 
 Fixes and cleanups:
 
   - Always save/restore x86 TSC sched_clock() on suspend/resume
    (Guilherme G. Piccoli)
 
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli,
     Sebastian Andrzej Siewior)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfejsoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ivkhAAwBF2tYRBS1oIHcC/OKK3JJoHVDp2LFbU
 9sm5S3ZlGD/Ns2fbpY+9A8UFgUFfjYiTSV7hvf2B9Vge0XSxTmMNFu/MdxLBbo9r
 w6GSeNcNDQKpjEGLkrmPFsa2fiYI4dmH0IzDbS9V2cNPk470QBKjAKXNPaSER691
 n2wLnQq+m5o4gXnPjnSz6RrrzisRnm2GOWnDV/iqR47pZFNlX2wWlo3s5r7//Hw0
 a+QfEfpgKehhy/VSDXmSAgpqnNffjc78yBV6LNoVUddwahnOWiQMS3XViOqgy5VO
 jUGBrzW+sKkdRMBppxwJ/0XWgHGC27amIgnU0ZE5u+eiUEu8H9qWl1cRCFxyeB0O
 8+WNfwmkH+FPWUdsn84kdePhSsZy6HfM6h44Xe0hx1V7tQXEXfbPzK3TnQg8Ktt1
 Ky6ctbZt4cGpqGQuIqvba21A/racrD/DgvB7mHeZksnqZoKTDwxhT/nlQGpuwPoy
 SJYd1ynFVJvfC69SwMdwnaglimvEZx1GfT0o5XtCMslY5NkWCou5u+e65WX7ccU5
 94wBCwI1/+KiFMJZp6TlPw07Q/Hsj9dDryxLc3OunU3zMVt3++1GZnBS/eb5FN1A
 5TlAEpxgH9c5Q4/XJFCvx21DrTVHSuIrR6naK91bgqHo0cEfaHGtO/ejPtJGSfxe
 YIFnnu1dhRw=
 =SuQE
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core & fair scheduler changes:

   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun
     Dong)

  Deadline scheduler: (Juri Lelli)
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h

  Uclamp:
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen
     Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)

  Scheduler topology support: (Juri Lelli)
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked

  RSEQ: (Michael Jeanson)
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes

  Membarriers:
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)

  Scheduler debugging:
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)

  Fixes and cleanups:
   - Always save/restore x86 TSC sched_clock() on suspend/resume
     (Guilherme G. Piccoli)
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian
     Andrzej Siewior)"

* tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling()
  sched/debug: Remove CONFIG_SCHED_DEBUG
  sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files
  sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation
  sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
  sched/debug: Make 'const_debug' tunables unconditional __read_mostly
  sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
  rseq/selftests: Fix namespace collision with rseq UAPI header
  include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h
  sched/topology: Stop exposing partition_sched_domains_locked
  cgroup/cpuset: Remove partition_and_rebuild_sched_domains
  sched/topology: Remove redundant dl_clear_root_domain call
  sched/deadline: Rebuild root domain accounting after every update
  sched/deadline: Generalize unique visiting of root domains
  sched/topology: Wrappers for sched_domains_mutex
  sched/deadline: Ignore special tasks when rebuilding domains
  tracing: Use preempt_model_str()
  xtensa: Rely on generic printing of preemption model
  x86: Rely on generic printing of preemption model
  s390: Rely on generic printing of preemption model
  ...
2025-03-24 21:28:12 -07:00
Linus Torvalds
94dc216ad8 cgroup: Changes for v6.15
- Add deprecation info messages to cgroup1-only features.
 
 - rstat updates including a bug fix and breaking up a critical section to
   reduce interrupt latency impact.
 
 - Other misc and doc updates.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ9xO2g4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGQz4AQDeWKmngRsnddEMkqOV1ArwXSr+8xUQrvCBx0RL
 vcjOQQEAusGCTeGXWJ96kw+N9BXvGwFsfSeoxjOqAnvrBS1EgAc=
 =WvJg
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Add deprecation info messages to cgroup1-only features

 - rstat updates including a bug fix and breaking up a critical section
   to reduce interrupt latency impact

 - Other misc and doc updates

* tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: rstat: Cleanup flushing functions and locking
  cgroup/rstat: avoid disabling irqs for O(num_cpu)
  mm: Fix a build breakage in memcontrol-v1.c
  blk-cgroup: Simplify policy files registration
  cgroup: Update file naming comment
  cgroup: Add deprecation message to legacy freezer controller
  mm: Add transformation message for per-memcg swappiness
  RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level
  cgroup/cpuset-v1: Add deprecation messages to memory_migrate
  cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall
  cgroup: Print message when /proc/cgroups is read on v2-only system
  cgroup/blkio: Add deprecation messages to reset_stats
  cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab
  cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled
  cgroup, docs: Be explicit about independence of RT_GROUP_SCHED and non-cpu controllers
  cgroup/rstat: Fix forceidle time in cpu.stat
  cgroup/misc: Remove unused misc_cg_res_total_usage
  cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c
  cgroup: update comment about dropping cgroup kn refs
2025-03-24 16:49:40 -07:00
Yosry Ahmed
093c8812de cgroup: rstat: Cleanup flushing functions and locking
Now that the rstat lock is being re-acquired on every CPU iteration in
cgroup_rstat_flush_locked(), having the initially acquire the lock is
unnecessary and unclear.

Inline cgroup_rstat_flush_locked() into cgroup_rstat_flush() and move
the lock/unlock calls to the beginning and ending of the loop body to
make the critical section obvious.

cgroup_rstat_flush_hold/release() do not make much sense with the lock
being dropped and reacquired internally. Since it has no external
callers, remove it and explicitly acquire the lock in
cgroup_base_stat_cputime_show() instead.

This leaves the code with a single flushing function,
cgroup_rstat_flush().

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-20 06:53:02 -10:00
Eric Dumazet
0efc297a3c cgroup/rstat: avoid disabling irqs for O(num_cpu)
cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
iterating all possible cpus. It only drops the lock if there is
scheduler or spin lock contention. If neither, then interrupts can be
disabled for a long time. On large machines this can disable interrupts
for a long enough time to drop network packets. On 400+ CPU machines
I've seen interrupt disabled for over 40 msec.

Prevent rstat from disabling interrupts while processing all possible
cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This
approach was previously discussed in
https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
though this was in the context of an non-irq rstat spin lock.

Benchmark this change with:
1) a single stat_reader process with 400 threads, each reading a test
   memcg's memory.stat repeatedly for 10 seconds.
2) 400 memory hog processes running in the test memcg and repeatedly
   charging memory until oom killed. Then they repeat charging and oom
   killing.

v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
interrupts are disabled by rstat for 45341 usec:
  #  => started at: _raw_spin_lock_irq
  #  => ended at:   cgroup_rstat_flush
  #
  #
  #                    _------=> CPU#
  #                   / _-----=> irqs-off/BH-disabled
  #                  | / _----=> need-resched
  #                  || / _---=> hardirq/softirq
  #                  ||| / _--=> preempt-depth
  #                  |||| / _-=> migrate-disable
  #                  ||||| /     delay
  #  cmd     pid     |||||| time  |   caller
  #     \   /        ||||||  \    |    /
  stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
  stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
  stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
  stat_rea-96532    52d.... 45343us : <stack trace>
   => memcg1_stat_format
   => memory_stat_format
   => memory_stat_show
   => seq_read_iter
   => vfs_read
   => ksys_read
   => do_syscall_64
   => entry_SYSCALL_64_after_hwframe

With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
longest holder. The longest irqs-off holder has irqs disabled for
4142 usec, a huge reduction from previous 45341 usec rstat finding.

Running stat_reader memory.stat reader for 10 seconds:
- without memory hogs: 9.84M accesses => 12.7M accesses
-    with memory hogs: 9.46M accesses => 11.1M accesses
The throughput of memory.stat access improves.

The mode of memory.stat access latency after grouping by of 2 buckets:
- without memory hogs: 64 usec => 16 usec
-    with memory hogs: 64 usec =>  8 usec
The memory.stat latency improves.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Tested-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-19 07:25:00 -10:00
Juri Lelli
ce9b3f93d7 cgroup/cpuset: Remove partition_and_rebuild_sched_domains
partition_and_rebuild_sched_domains() and partition_sched_domains() are
now equivalent.

Remove the former as a nice clean up.

Suggested-by: Waiman Long <llong@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <llong@redhat.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MR4ryNDJZDzsSG@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:42 +01:00
Juri Lelli
2ff899e351 sched/deadline: Rebuild root domain accounting after every update
Rebuilding of root domains accounting information (total_bw) is
currently broken on some cases, e.g. suspend/resume on aarch64. Problem
is that the way we keep track of domain changes and try to add bandwidth
back is convoluted and fragile.

Fix it by simplify things by making sure bandwidth accounting is cleared
and completely restored after root domains changes (after root domains
are again stable).

To be sure we always call dl_rebuild_rd_accounting while holding
cpuset_mutex we also add cpuset_reset_sched_domains() wrapper.

Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Co-developed-by: Waiman Long <llong@redhat.com>
Signed-off-by: Waiman Long <llong@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:42 +01:00
Juri Lelli
56209334dd sched/topology: Wrappers for sched_domains_mutex
Create wrappers for sched_domains_mutex so that it can transparently be
used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to
do.

Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MP5Oq9RB8jBs3y@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:41 +01:00
Michal Koutný
4a893bdc18 blk-cgroup: Simplify policy files registration
Use one set of files when there is no difference between default and
legacy files, similar to regular subsys files registration. No
functional change.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11 09:22:55 -10:00
Michal Koutný
78f6519ed0 cgroup: Add deprecation message to legacy freezer controller
As explained in the commit 76f969e8948d8 ("cgroup: cgroup v2 freezer"),
the original freezer is imperfect, some users may unwittingly rely on it
when there exists the alternative of v2. Print a message when it happens
and explain that in the docs.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11 09:22:54 -10:00
Michal Koutný
103149a063 RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level
This is not a properly hierarchical resource, it might be better
implemented based on a sched_attr.

Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11 09:22:54 -10:00
Michal Koutný
db4dc20c11 cgroup/cpuset-v1: Add deprecation messages to memory_migrate
Memory migration (between cgroups) was given up in v2 due to performance
reasons of its implementation. Migration between NUMA nodes within one
memcg may still make sense to modify affinity at runtime though.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11 09:22:54 -10:00
Michal Koutný
3138192792 cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall
The concept of exclusive memory affinity may require complex approaches
like with cpuset v2 cpu partitions. There is so far no implementation in
cpuset v2.
Specific kernel memory affinity may cause unintended (global)
bottlenecks like kmem limits.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11 09:22:54 -10:00
Michal Koutný
a0ab145322 cgroup: Print message when /proc/cgroups is read on v2-only system
As a followup to commits 6c2920926b10e ("cgroup: replace
unified-hierarchy.txt with a proper cgroup v2 documentation") and
ab03125268679 ("cgroup: Show # of subsystem CSSes in cgroup.stat"),
add a runtime message to users who read status of controllers in
/proc/cgroups on v2-only system. The detection is based on a)
no controllers are attached to v1, b) default hierarchy is mounted (the
latter is for setups that never mount v2 but read /proc/cgroups upon
boot when controllers default to v2, so that this code may be backported
to older kernels).

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11 09:22:54 -10:00
Michal Koutný
012c419f8d cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab
There is MPOL_INTERLEAVE for user explicit allocations.
Deprecate spreading of allocations that users carry out unwittingly.
Use straight warning level for slab spreading since such a knob is
unnecessarily intertwined with slab allocator.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11 09:22:54 -10:00
Michal Koutný
a0dd846257 cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled
These two v1 feature have analogues in cgroup v2.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-11 09:22:54 -10:00
Greg Kroah-Hartman
993a47bd7b Linux 6.14-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfOKBUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG1aQH/iC+Oyij4VxAjBek
 BOXIT/p6CwlIXb8ObiWWcRjDPizlcxb3RaV8J2RO+IqaQ2wltxpFANq2G7Re2FPm
 SNcEpIURAOVcxHGedcfFA91srO5F4FzNTO8LVp7MIbcgMYy3pdk+dbZmi6A691R+
 t9pb74m+MAnF1o/MUx7pUlhAT/4ymuuR0F7WCSg4h0Xwe5m0nlJY89kJBC7PCjyd
 n3mdhsz3rDSLmt/z/T7HGD89r8sYSvm9cOKtL3ELgGTrm7boQV8ii9Y9w04DI8PQ
 JmIernugcCxmhH36mVUAHgJf2+/T388xFUh/D5+skeUOUZpaJZG866rnb32WpsHc
 eWLFUeg=
 =Wypt
 -----END PGP SIGNATURE-----

Merge 6.14-rc6 into driver-core-next

We need the driver core fix in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-10 17:37:25 +01:00
Abel Wu
c4af66a95a cgroup/rstat: Fix forceidle time in cpu.stat
The commit b824766504e4 ("cgroup/rstat: add force idle show helper")
retrieves forceidle_time outside cgroup_rstat_lock for non-root cgroups
which can be potentially inconsistent with other stats.

Rather than reverting that commit, fix it in a way that retains the
effort of cleaning up the ifdef-messes.

Fixes: b824766504e4 ("cgroup/rstat: add force idle show helper")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-27 07:19:17 -10:00
Dave Airlie
395436f3bd An reset signal polarity fix for the jd9365da-h3 panel, a folio handling
fix and config fix in nouveau, a dmem cgroup descendant pool handling
 fix, and a missing header for amdxdna.
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQTkHFbLp4ejekA/qfgnX84Zoj2+dgUCZ7bn6AAKCRAnX84Zoj2+
 djJnAX9XDHzW0CCJnF8UopQearYcn2DPKrXvLKWwwpSothyoOFiIHyifP7fHlQBX
 XA5+iIQBf1RZq3uTeqbaq3DeD0Pf9LrRUC41g3H7HO9Lt0/Bp2dnRqJJgZVxIg+h
 57yAKxQ5Cw==
 =PwGs
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-fixes-2025-02-20' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

An reset signal polarity fix for the jd9365da-h3 panel, a folio handling
fix and config fix in nouveau, a dmem cgroup descendant pool handling
fix, and a missing header for amdxdna.

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250220-glorious-cockle-of-might-5b35f7@houat
2025-02-21 09:16:35 +10:00
Friedrich Vock
8821f36333 cgroup/dmem: Don't open-code css_for_each_descendant_pre
The current implementation has a bug: If the current css doesn't
contain any pool that is a descendant of the "pool" (i.e. when
found_descendant == false), then "pool" will point to some unrelated
pool. If the current css has a child, we'll overwrite parent_pool with
this unrelated pool on the next iteration.

Since we can just check whether a pool refers to the same region to
determine whether or not it's related, all the additional pool tracking
is unnecessary, so just switch to using css_for_each_descendant_pre for
traversal.

Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250127152754.21325-1-friedrich.vock@gmx.de
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
2025-02-19 09:50:37 +01:00
Greg Kroah-Hartman
2ce177e9b3 Merge 6.14-rc3 into driver-core-next
We need the faux_device changes in here for future work.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17 07:24:33 +01:00
Sebastian Andrzej Siewior
633488947e kernfs: Use RCU to access kernfs_node::parent.
kernfs_rename_lock is used to obtain stable kernfs_node::{name|parent}
pointer. This is a preparation to access kernfs_node::parent under RCU
and ensure that the pointer remains stable under the RCU lifetime
guarantees.

For a complete path, as it is done in kernfs_path_from_node(), the
kernfs_rename_lock is still required in order to obtain a stable parent
relationship while computing the relevant node depth. This must not
change while the nodes are inspected in order to build the path.
If the kernfs user never moves the nodes (changes the parent) then the
kernfs_rename_lock is not required and the RCU guarantees are
sufficient. This "restriction" can be set with
KERNFS_ROOT_INVARIANT_PARENT. Otherwise the lock is required.

Rename kernfs_node::parent to kernfs_node::__parent to denote the RCU
access and use RCU accessor while accessing the node.
Make cgroup use KERNFS_ROOT_INVARIANT_PARENT since the parent here can
not change.

Acked-by: Tejun Heo <tj@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250213145023.2820193-6-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-15 17:46:32 +01:00
Muhammad Adeel
db5fd3cf8b cgroup: Remove steal time from usage_usec
The CPU usage time is the time when user, system or both are using the CPU.
Steal time is the time when CPU is waiting to be run by the Hypervisor. It
should not be added to the CPU usage time, hence removing it from the
usage_usec entry.

Fixes: 936f2a70f2077 ("cgroup: add cpu.stat file to root cgroup")
Acked-by: Axel Busch <axel.busch@ibm.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Muhammad Adeel <muhammad.adeel@ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07 11:02:17 -10:00
Shakeel Butt
b69bb476de cgroup: fix race between fork and cgroup.kill
Tejun reported the following race between fork() and cgroup.kill at [1].

Tejun:
  I was looking at cgroup.kill implementation and wondering whether there
  could be a race window. So, __cgroup_kill() does the following:

   k1. Set CGRP_KILL.
   k2. Iterate tasks and deliver SIGKILL.
   k3. Clear CGRP_KILL.

  The copy_process() does the following:

   c1. Copy a bunch of stuff.
   c2. Grab siglock.
   c3. Check fatal_signal_pending().
   c4. Commit to forking.
   c5. Release siglock.
   c6. Call cgroup_post_fork() which puts the task on the css_set and tests
       CGRP_KILL.

  The intention seems to be that either a forking task gets SIGKILL and
  terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I
  don't see what guarantees that k3 can't happen before c6. ie. After a
  forking task passes c5, k2 can take place and then before the forking task
  reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child.
  What am I missing?

This is indeed a race. One way to fix this race is by taking
cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork()
side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork()
to cgroup_post_fork(). However that would be heavy handed as this adds
one more potential stall scenario for cgroup.kill which is usually
called under extreme situation like memory pressure.

To fix this race, let's maintain a sequence number per cgroup which gets
incremented on __cgroup_kill() call. On the fork() side, the
cgroup_can_fork() will cache the sequence number locally and recheck it
against the cgroup's sequence number at cgroup_post_fork() site. If the
sequence numbers mismatch, it means __cgroup_kill() can been called and
we should send SIGKILL to the newly created task.

Reported-by: Tejun Heo <tj@kernel.org>
Closes: https://lore.kernel.org/all/Z5QHE2Qn-QZ6M-KW@slm.duckdns.org/ [1]
Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 06:54:51 -10:00
Dr. David Alan Gilbert
ad6c08d8c1 cgroup/misc: Remove unused misc_cg_res_total_usage
misc_cg_res_total_usage() was added in 2021 by
commit a72232eabdfc ("cgroup: Add misc cgroup controller")

but has remained unused.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-28 09:00:54 -10:00
Michal Koutný
dae68fba8e cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c
The cpuset file is a legacy attribute that is bound primarily to cpuset
v1 hierarchy (equivalent information is available in /proc/$pid/cgroup path
on the unified hierarchy in conjunction with respective
cgroup.controllers showing where cpuset controller is enabled).

Followup to commit b0ced9d378d49 ("cgroup/cpuset: move v1 interfaces to
cpuset-v1.c") and hide CONFIG_PROC_PID_CPUSET under CONFIG_CPUSETS_V1.
Drop an obsolete comment too.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24 12:01:41 -10:00
Simona Vetter
07c5b27720 Linux 6.13
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmeNkBEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGgN4IAIEa4TrwIPP0cI4h
 iLI7rXcQaQplFlEGcmLutzItlf2YnoSoIa7fJoThJwKZIcz5o76sqtbTvQebZGsO
 SRmthEixpFdJK//5fQR1OZaSsMAifH2kQrEjvQqF7OxwvVOOAHZ7bUuyOvrTRFE8
 Su6kUXjmtN4O2oBEVgPKiStjd2sIqT8+Y67WjGwbDY7cU7m0qN4aBegPg5wrHQWm
 edIBWN53dv/5R197qntCxaTGG+OsiFHr6LMfg6tLhq8Pw+hFGAcdgcUZ2YeCbmw8
 noN0ukiaOewRgYmZI8oj8x6+zncNR/SWFNgAMxnLvK8o5oHx0R/0CtgNZSi7ocn3
 WIm9hzg=
 =m02S
 -----END PGP SIGNATURE-----

Merge v6.13 into drm-next

A regression was caused by commit e4b5ccd392b9 ("drm/v3d: Ensure job
pointer is set to NULL after job completion"), but this commit is not
yet in next-fixes, fast-forward it.

Note that this recreates Linus merge in 96c84703f1cf ("Merge tag
'drm-next-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel")
because I didn't want to backmerge a random point in the merge window.

Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
2025-01-23 14:42:21 +01:00
Linus Torvalds
96c84703f1 drm next for 6.14-rc1
core:
 - device memory cgroup controller added
 - Remove driver date from drm_driver
 - Add drm_printer based hex dumper
 - drm memory stats docs update
 - scheduler documentation improvements
 
 new driver:
 - amdxdna - Ryzen AI NPU support
 
 connector:
 - add a mutex to protect ELD
 - make connector setup two-step
 
 panels:
 - Introduce backlight quirks infrastructure
 - New panels: KDB KD116N2130B12, Tianma TM070JDHG34-00,
 - Multi-Inno Technology MI1010Z1T-1CP11
 
 bridge:
 - ti-sn65dsi83: Add ti,lvds-vod-swing optional properties
 - Provide default implementation of atomic_check for HDMI bridges
 - it605: HDCP improvements, MCCS Support
 
 xe:
 - make OA buffer size configurable
 - GuC capture fixes
 - add ufence and g2h flushes
 - restore system memory GGTT mappings
 - ioctl fixes
 - SRIOV PF scheduling priority
 - allow fault injection
 - lots of improvements/refactors
 - Enable GuC's WA_DUAL_QUEUE for newer platforms
 - IRQ related fixes and improvements
 
 i915:
 - More accurate engine busyness metrics with GuC submission
 - Ensure partial BO segment offset never exceeds allowed max
 - Flush GuC CT receive tasklet during reset preparation
 - Some DG2 refactor to fix DG2 bugs when operating with certain CPUs
 - Fix DG1 power gate sequence
 - Enabling uncompressed 128b/132b UHBR SST
 - Handle hdmi connector init failures, and no HDMI/DP cases
 - More robust engine resets on Haswell and older
 
 i915/xe display:
 - HDCP fixes for Xe3Lpd
 - New GSC FW ARL-H/ARL-U
 - support 3 VDSC engines 12 slices
 - MBUS joining sanitisation
 - reconcile i915/xe display power mgmt
 - Xe3Lpd fixes
 - UHBR rates for Thunderbolt
 
 amdgpu:
 - DRM panic support
 - track BO memory stats at runtime
 - Fix max surface handling in DC
 - Cleaner shader support for gfx10.3 dGPUs
 - fix drm buddy trim handling
 - SDMA engine reset updates
 - Fix doorbell ttm cleanup
 - RAS updates
 - ISP updates
 - SDMA queue reset support
 - Rework DPM powergating interfaces
 - Documentation updates and cleanups
 - DCN 3.5 updates
 - Use a pm notifier to more gracefully handle VRAM eviction on suspend or hibernate
 - Add debugfs interfaces for forcing scheduling to specific engine instances
 - GG 9.5 updates
 - IH 4.4 updates
 - Make missing optional firmware less noisy
 - PSP 13.x updates
 - SMU 13.x updates
 - VCN 5.x updates
 - JPEG 5.x updates
 - GC 12.x updates
 - DC FAMS updates
 
 amdkfd:
 - GG 9.5 updates
 - Logging improvements
 - Shader debugger fixes
 - Trap handler cleanup
 - Cleanup includes
 - Eviction fence wq fix
 
 msm:
 - MDSS:
 - properly described UBWC registers
 - added SM6150 (aka QCS615) support
 - DPU:
 - added SM6150 (aka QCS615) support
 - enabled wide planes if virtual planes are enabled (by using two SSPPs for a single plane)
 - added CWB hardware blocks support
 - DSI:
 - added SM6150 (aka QCS615) support
 - GPU:
 - Print GMU core fw version
 - GMU bandwidth voting for a740 and a750
 - Expose uche trap base via uapi
 - UAPI error reporting
 
 rcar-du:
 - Add r8a779h0 Support
 
 ivpu:
 - Fix qemu crash when using passthrough
 
 nouveau:
 - expose GSP-RM logging buffers via debugfs
 
 panfrost:
 - Add MT8188 Mali-G57 MC3 support
 
 rockchip:
 - Gamma LUT support
 
 hisilicon:
 - new HIBMC support
 
 virtio-gpu:
 - convert to helpers
 - add prime support for scanout buffers
 
 v3d:
 - Add DRM_IOCTL_V3D_PERFMON_SET_GLOBAL
 
 vc4:
 - Add support for BCM2712
 
 vkms:
 - line-per-line compositing algorithm to improve performance
 
 zynqmp:
 - Add DP audio support
 
 mediatek:
 - dp: Add sdp path reset
 - dp: Support flexible length of DP calibration data
 
 etnaviv:
 - add fdinfo memory support
 - add explicit reset handling
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmeJ5qYACgkQDHTzWXnE
 hr4o+w/9EbijDfyf8GCj4Qaxov8nZ3KEMW8LLmrYO3epfLsniX+nv01oNdbRXBjl
 QcsKixAvkyfLl61RuPnwbYiSJfxgwZ5K8rke7cshwlMB7zl7xZ+GZRoAmJlnokS4
 uhmclCriW5nfKRNAGUPcj/ReGZeyHwqvGZn3jyuShkIFpE4rDope4DQsTzm/zs/i
 +cKyRAFm86EIdTACr9DVtb1L5uNZOnHDkufRH5EZr/7CWFco1krLxb/r4cvFaiIO
 GiDaLvXKXKwzQ6NeIWWCEU2zTBz0BluI8ggxp1+WlDiYgLDWtCBpBNPAoNJO/iQS
 J+E8bsk2b/aCLSJQgxcK0y80CXpoJyALaqStdHUqxuWv3/o0g8lFUJlfJVCNPIsg
 o4mBkdbgkzkHCPxUbie7uQIx+2DIsEiwWC/YGBeRx49qEYsLWyFHf6JR8j9aHCQq
 eGanaubzR+W2AC81yktd3rcxpmX5kq8n6ax3ZtS9wnio8iyB5jBDM8QeFSAE/vXV
 B5TT1nneh+HXJ6bTwZBFXkiq2JRxUdbZIS5oQLh0zixVthBMISSsYhJ222nH1bC4
 DWIS2ggqSgqkb0WsE29CJyhJ1fPmS3v7lBXqPvjmN5vMto4gGOJAEgT6CiDpGFIz
 zXzNfrirr1r95iSST4PnYVOOkfK3t9gvbWMXgkr0wygtxyoxHzk=
 =5FIc
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "There are two external interactions of note, the msm tree pull in some
  opp tree, hopefully the opp tree arrives from the same git tree
  however it normally does.

  There is also a new cgroup controller for device memory, that is used
  by drm, so is merging through my tree. This will hopefully help open
  up gpu cgroup usage a bit more and move us forward.

  There is a new accelerator driver for the AMD XDNA Ryzen AI NPUs.

  Then the usual xe/amdgpu/i915/msm leaders and lots of changes and
  refactors across the board:

  core:
   - device memory cgroup controller added
   - Remove driver date from drm_driver
   - Add drm_printer based hex dumper
   - drm memory stats docs update
   - scheduler documentation improvements

  new driver:
   - amdxdna - Ryzen AI NPU support

  connector:
   - add a mutex to protect ELD
   - make connector setup two-step

  panels:
   - Introduce backlight quirks infrastructure
   - New panels: KDB KD116N2130B12, Tianma TM070JDHG34-00,
   - Multi-Inno Technology MI1010Z1T-1CP11

  bridge:
   - ti-sn65dsi83: Add ti,lvds-vod-swing optional properties
   - Provide default implementation of atomic_check for HDMI bridges
   - it605: HDCP improvements, MCCS Support

  xe:
   - make OA buffer size configurable
   - GuC capture fixes
   - add ufence and g2h flushes
   - restore system memory GGTT mappings
   - ioctl fixes
   - SRIOV PF scheduling priority
   - allow fault injection
   - lots of improvements/refactors
   - Enable GuC's WA_DUAL_QUEUE for newer platforms
   - IRQ related fixes and improvements

  i915:
   - More accurate engine busyness metrics with GuC submission
   - Ensure partial BO segment offset never exceeds allowed max
   - Flush GuC CT receive tasklet during reset preparation
   - Some DG2 refactor to fix DG2 bugs when operating with certain CPUs
   - Fix DG1 power gate sequence
   - Enabling uncompressed 128b/132b UHBR SST
   - Handle hdmi connector init failures, and no HDMI/DP cases
   - More robust engine resets on Haswell and older

  i915/xe display:
   - HDCP fixes for Xe3Lpd
   - New GSC FW ARL-H/ARL-U
   - support 3 VDSC engines 12 slices
   - MBUS joining sanitisation
   - reconcile i915/xe display power mgmt
   - Xe3Lpd fixes
   - UHBR rates for Thunderbolt

  amdgpu:
   - DRM panic support
   - track BO memory stats at runtime
   - Fix max surface handling in DC
   - Cleaner shader support for gfx10.3 dGPUs
   - fix drm buddy trim handling
   - SDMA engine reset updates
   - Fix doorbell ttm cleanup
   - RAS updates
   - ISP updates
   - SDMA queue reset support
   - Rework DPM powergating interfaces
   - Documentation updates and cleanups
   - DCN 3.5 updates
   - Use a pm notifier to more gracefully handle VRAM eviction on
     suspend or hibernate
   - Add debugfs interfaces for forcing scheduling to specific engine
     instances
   - GG 9.5 updates
   - IH 4.4 updates
   - Make missing optional firmware less noisy
   - PSP 13.x updates
   - SMU 13.x updates
   - VCN 5.x updates
   - JPEG 5.x updates
   - GC 12.x updates
   - DC FAMS updates

  amdkfd:
   - GG 9.5 updates
   - Logging improvements
   - Shader debugger fixes
   - Trap handler cleanup
   - Cleanup includes
   - Eviction fence wq fix

  msm:
   - MDSS:
      - properly described UBWC registers
      - added SM6150 (aka QCS615) support
   - DPU:
      - added SM6150 (aka QCS615) support
      - enabled wide planes if virtual planes are enabled (by using two
        SSPPs for a single plane)
      - added CWB hardware blocks support
   - DSI:
      - added SM6150 (aka QCS615) support
   - GPU:
      - Print GMU core fw version
      - GMU bandwidth voting for a740 and a750
      - Expose uche trap base via uapi
      - UAPI error reporting

  rcar-du:
   - Add r8a779h0 Support

  ivpu:
   - Fix qemu crash when using passthrough

  nouveau:
   - expose GSP-RM logging buffers via debugfs

  panfrost:
   - Add MT8188 Mali-G57 MC3 support

  rockchip:
   - Gamma LUT support

  hisilicon:
   - new HIBMC support

  virtio-gpu:
   - convert to helpers
   - add prime support for scanout buffers

  v3d:
   - Add DRM_IOCTL_V3D_PERFMON_SET_GLOBAL

  vc4:
   - Add support for BCM2712

  vkms:
   - line-per-line compositing algorithm to improve performance

  zynqmp:
   - Add DP audio support

  mediatek:
   - dp: Add sdp path reset
   - dp: Support flexible length of DP calibration data

  etnaviv:
   - add fdinfo memory support
   - add explicit reset handling"

* tag 'drm-next-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel: (1070 commits)
  drm/bridge: fix documentation for the hdmi_audio_prepare() callback
  doc/cgroup: Fix title underline length
  drm/doc: Include new drm-compute documentation
  cgroup/dmem: Fix parameters documentation
  cgroup/dmem: Select PAGE_COUNTER
  kernel/cgroup: Remove the unused variable climit
  drm/display: hdmi: Do not read EDID on disconnected connectors
  drm/tests: hdmi: Add connector disablement test
  drm/connector: hdmi: Do atomic check when necessary
  drm/amd/display: 3.2.316
  drm/amd/display: avoid reset DTBCLK at clock init
  drm/amd/display: improve dpia pre-train
  drm/amd/display: Apply DML21 Patches
  drm/amd/display: Use HW lock mgr for PSR1
  drm/amd/display: Revised for Replay Pseudo vblank control
  drm/amd/display: Add a new flag for replay low hz
  drm/amd/display: Remove unused read_ono_state function from Hwss module
  drm/amd/display: Do not elevate mem_type change to full update
  drm/amd/display: Do not wait for PSR disable on vbl enable
  drm/amd/display: Remove unnecessary eDP power down
  ...
2025-01-21 16:09:47 -08:00
Haorui He
4a6780a30e cgroup: update comment about dropping cgroup kn refs
the cgroup is actually freed in css_free_rwork_fn() now
the ref count of the cgroup's kernfs_node is also dropped there
so we need to update the corresponding comment in cgroup_mkdir()

Signed-off-by: Haorui He <mail@hehaorui.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-21 09:24:47 -10:00
Maxime Ripard
feb85972b8
cgroup/dmem: Fix parameters documentation
During the dmem cgroup development, the parameters to the
dmem_cgroup_state_evict_valuable() and dmem_cgroup_try_charge() were
changed, but the documentation wasn't adjusted accordingly.

This results in a documentation build warning. Adjust the documentation
to reflect what the final functions parameters are.

Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/r/20250113160334.1f09f881@canb.auug.org.au/
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20250113092608.1349287-2-mripard@kernel.org
Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-01-15 09:45:24 +01:00
Jiapeng Chong
8f52fd7a7d
kernel/cgroup: Remove the unused variable climit
Variable climit is not effectively used, so delete it.

kernel/cgroup/dmem.c:302:23: warning: variable ‘climit’ set but not used.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=13512
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250114062804.5092-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-01-15 09:39:26 +01:00
Chen Ridong
3cb97a927f cgroup/cpuset: remove kernfs active break
A warning was found:

WARNING: CPU: 10 PID: 3486953 at fs/kernfs/file.c:828
CPU: 10 PID: 3486953 Comm: rmdir Kdump: loaded Tainted: G
RIP: 0010:kernfs_should_drain_open_files+0x1a1/0x1b0
RSP: 0018:ffff8881107ef9e0 EFLAGS: 00010202
RAX: 0000000080000002 RBX: ffff888154738c00 RCX: dffffc0000000000
RDX: 0000000000000007 RSI: 0000000000000004 RDI: ffff888154738c04
RBP: ffff888154738c04 R08: ffffffffaf27fa15 R09: ffffed102a8e7180
R10: ffff888154738c07 R11: 0000000000000000 R12: ffff888154738c08
R13: ffff888750f8c000 R14: ffff888750f8c0e8 R15: ffff888154738ca0
FS:  00007f84cd0be740(0000) GS:ffff8887ddc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555f9fbe00c8 CR3: 0000000153eec001 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 kernfs_drain+0x15e/0x2f0
 __kernfs_remove+0x165/0x300
 kernfs_remove_by_name_ns+0x7b/0xc0
 cgroup_rm_file+0x154/0x1c0
 cgroup_addrm_files+0x1c2/0x1f0
 css_clear_dir+0x77/0x110
 kill_css+0x4c/0x1b0
 cgroup_destroy_locked+0x194/0x380
 cgroup_rmdir+0x2a/0x140

It can be explained by:
rmdir 				echo 1 > cpuset.cpus
				kernfs_fop_write_iter // active=0
cgroup_rm_file
kernfs_remove_by_name_ns	kernfs_get_active // active=1
__kernfs_remove					  // active=0x80000002
kernfs_drain			cpuset_write_resmask
wait_event
//waiting (active == 0x80000001)
				kernfs_break_active_protection
				// active = 0x80000001
// continue
				kernfs_unbreak_active_protection
				// active = 0x80000002
...
kernfs_should_drain_open_files
// warning occurs
				kernfs_put_active

This warning is caused by 'kernfs_break_active_protection' when it is
writing to cpuset.cpus, and the cgroup is removed concurrently.

The commit 3a5a6d0c2b03 ("cpuset: don't nest cgroup_mutex inside
get_online_cpus()") made cpuset_hotplug_workfn asynchronous, This change
involves calling flush_work(), which can create a multiple processes
circular locking dependency that involve cgroup_mutex, potentially leading
to a deadlock. To avoid deadlock. the commit 76bb5ab8f6e3 ("cpuset: break
kernfs active protection in cpuset_write_resmask()") added
'kernfs_break_active_protection' in the cpuset_write_resmask. This could
lead to this warning.

After the commit 2125c0034c5d ("cgroup/cpuset: Make cpuset hotplug
processing synchronous"), the cpuset_write_resmask no longer needs to
wait the hotplug to finish, which means that concurrent hotplug and cpuset
operations are no longer possible. Therefore, the deadlock doesn't exist
anymore and it does not have to 'break active protection' now. To fix this
warning, just remove kernfs_break_active_protection operation in the
'cpuset_write_resmask'.

Fixes: bdb2fd7fc56e ("kernfs: Skip kernfs_drain_open_files() more aggressively")
Fixes: 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()")
Reported-by: Ji Fa <jifa@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 15:54:39 -10:00
Maarten Lankhorst
b168ed458d
kernel/cgroup: Add "dmem" memory accounting cgroup
This code is based on the RDMA and misc cgroup initially, but now
uses page_counter. It uses the same min/low/max semantics as the memory
cgroup as a result.

There's a small mismatch as TTM uses u64, and page_counter long pages.
In practice it's not a problem. 32-bits systems don't really come with
>=4GB cards and as long as we're consistently wrong with units, it's
fine. The device page size may not be in the same units as kernel page
size, and each region might also have a different page size (VRAM vs GART
for example).

The interface is simple:
- Call dmem_cgroup_register_region()
- Use dmem_cgroup_try_charge to check if you can allocate a chunk of memory,
  use dmem_cgroup__uncharge when freeing it. This may return an error code,
  or -EAGAIN when the cgroup limit is reached. In that case a reference
  to the limiting pool is returned.
- The limiting cs can be used as compare function for
  dmem_cgroup_state_evict_valuable.
- After having evicted enough, drop reference to limiting cs with
  dmem_cgroup_pool_state_put.

This API allows you to limit device resources with cgroups.
You can see the supported cards in /sys/fs/cgroup/dmem.capacity
You need to echo +dmem to cgroup.subtree_control, and then you can
partition device memory.

Co-developed-by: Friedrich Vock <friedrich.vock@gmx.de>
Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de>
Co-developed-by: Maxime Ripard <mripard@kernel.org>
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20241204143112.1250983-1-dev@lankhorst.se
Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-01-06 17:24:38 +01:00
Waiman Long
9b496a8bbe cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains
Isolated CPUs are not allowed to be used in a non-isolated partition.
The only exception is the top cpuset which is allowed to contain boot
time isolated CPUs.

Commit ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation
problem") introduces a simplified scheme of including only partition
roots in sched domain generation. However, it does not properly account
for this exception case. This can result in leakage of isolated CPUs
into a sched domain.

Fix it by making sure that isolated CPUs are excluded from the top
cpuset before generating sched domains.

Also update the way the boot time isolated CPUs are handled in
test_cpuset_prs.sh to make sure that those isolated CPUs are really
isolated instead of just skipping them in the tests.

Fixes: ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-11 05:45:52 -10:00
Costa Shulyupin
eb1dd15fb2 cgroup/cpuset: Remove stale text
Task's cpuset pointer was removed by
commit 8793d854edbc ("Task Control Groups: make cpusets a client of cgroups")

Paragraph "The task_lock() exception ...." was removed by
commit 2df167a300d7 ("cgroups: update comments in cpuset.c")

Remove stale text:

 We also require taking task_lock() when dereferencing a
 task's cpuset pointer. See "The task_lock() exception", at the end of this
 comment.

 Accessing a task's cpuset should be done in accordance with the
 guidelines for accessing subsystem state in kernel/cgroup.c

and reformat.

Co-developed-by: Michal Koutný <mkoutny@suse.com>
Co-developed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-10 20:38:41 -10:00
Linus Torvalds
7586d52765 cgroup: Changes for v6.13
- cpu.stat now also shows niced CPU time.
 
 - Freezer and cpuset optimizations.
 
 - Other misc changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZztlgg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGbohAQDE/enqpAX9vSOpQPne4ZzgcPlGTrCwBcka3Z5z
 4aOF0AD/SmdjcJ/EULisD/2O27ovsGAtqDjngrrZwNUTbCNkTQQ=
 =pKyo
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpu.stat now also shows niced CPU time

 - Freezer and cpuset optimizations

 - Other misc changes

* tag 'cgroup-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Disable cpuset_cpumask_can_shrink() test if not load balancing
  cgroup/cpuset: Further optimize code if CONFIG_CPUSETS_V1 not set
  cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation
  cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()"
  MAINTAINERS: remove Zefan Li
  cgroup/freezer: Add cgroup CGRP_FROZEN flag update helper
  cgroup/freezer: Reduce redundant traversal for cgroup_freeze
  cgroup/bpf: only cgroup v2 can be attached by bpf programs
  Revert "cgroup: Fix memory leak caused by missing cgroup_bpf_offline"
  selftests/cgroup: Fix compile error in test_cpu.c
  cgroup/rstat: Selftests for niced CPU statistics
  cgroup/rstat: Tracking cgroup-level niced CPU time
  cgroup/cpuset: Fix spelling errors in file kernel/cgroup/cpuset.c
2024-11-20 09:54:49 -08:00
Linus Torvalds
0f25f0e4ef the bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same
 scope where they'd been created, getting rid of reassignments
 and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
 
 We are getting very close to having the memory safety of that stuff
 trivial to verify.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
 69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
 CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
 =gVH3
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' class updates from Al Viro:
 "The bulk of struct fd memory safety stuff

  Making sure that struct fd instances are destroyed in the same scope
  where they'd been created, getting rid of reassignments and passing
  them by reference, converting to CLASS(fd{,_pos,_raw}).

  We are getting very close to having the memory safety of that stuff
  trivial to verify"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  deal with the last remaing boolean uses of fd_file()
  css_set_fork(): switch to CLASS(fd_raw, ...)
  memcg_write_event_control(): switch to CLASS(fd)
  assorted variants of irqfd setup: convert to CLASS(fd)
  do_pollfd(): convert to CLASS(fd)
  convert do_select()
  convert vfs_dedupe_file_range().
  convert cifs_ioctl_copychunk()
  convert media_request_get_by_fd()
  convert spu_run(2)
  switch spufs_calls_{get,put}() to CLASS() use
  convert cachestat(2)
  convert do_preadv()/do_pwritev()
  fdget(), more trivial conversions
  fdget(), trivial conversions
  privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
  o2hb_region_dev_store(): avoid goto around fdget()/fdput()
  introduce "fd_pos" class, convert fdget_pos() users to it.
  fdget_raw() users: switch to CLASS(fd_raw)
  convert vmsplice() to CLASS(fd)
  ...
2024-11-18 12:24:06 -08:00
Waiman Long
fbfbf86685 cgroup/cpuset: Disable cpuset_cpumask_can_shrink() test if not load balancing
With some recent proposed changes [1] in the deadline server code,
it has caused a test failure in test_cpuset_prs.sh when a change
is being made to an isolated partition. This is due to failing
the cpuset_cpumask_can_shrink() check for SCHED_DEADLINE tasks at
validate_change().

This is actually a false positive as the failed test case involves an
isolated partition with load balancing disabled. The deadline check
is not meaningful in this case and the users should know what they
are doing.

Fix this by doing the cpuset_cpumask_can_shrink() check only when loading
balanced is enabled. Also change its arguments to use effective_cpus
for the current cpuset and user_xcpus() as an approiximation for the
target effective_cpus as the real effective_cpus hasn't been fully
computed yet as this early stage.

As the check isn't comprehensive, there may be false positives or
negatives. We may have to revise the code to do a more thorough check
in the future if this becomes a concern.

[1] https://lore.kernel.org/lkml/82be06c1-6d6d-4651-86c9-bcc828cbcb80@redhat.com/T/#t

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-14 08:44:03 -10:00
Waiman Long
c4c9cebe2f cgroup/cpuset: Further optimize code if CONFIG_CPUSETS_V1 not set
Currently the cpuset code uses group_subsys_on_dfl() to check if we
are running with cgroup v2. If CONFIG_CPUSETS_V1 isn't set, there is
really no need to do this check and we can optimize out some of the
unneeded v1 specific code paths. Introduce a new cpuset_v2() and use it
to replace the cgroup_subsys_on_dfl() check to further optimize the
code.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-12 09:07:38 -10:00
Waiman Long
a040c35128 cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation
Since commit ff0ce721ec21 ("cgroup/cpuset: Eliminate unncessary
sched domains rebuilds in hotplug"), there is only one
rebuild_sched_domains_locked() call per hotplug operation. However,
writing to the various cpuset control files may still casue more than
one rebuild_sched_domains_locked() call to happen in some cases.

Juri had found that two rebuild_sched_domains_locked() calls in
update_prstate(), one from update_cpumasks_hier() and another one from
update_partition_sd_lb() could cause cpuset partition to be created
with null total_bw for DL tasks. IOW, DL tasks may not be scheduled
correctly in such a partition.

A sample command sequence that can reproduce null total_bw is as
follows.

  # echo Y >/sys/kernel/debug/sched/verbose
  # echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
  # mkdir /sys/fs/cgroup/test
  # echo 0-7 > /sys/fs/cgroup/test/cpuset.cpus
  # echo 6-7 > /sys/fs/cgroup/test/cpuset.cpus.exclusive
  # echo root >/sys/fs/cgroup/test/cpuset.cpus.partition

Fix this double rebuild_sched_domains_locked() calls problem
by replacing existing calls with cpuset_force_rebuild() except
the rebuild_sched_domains_cpuslocked() call at the end of
cpuset_handle_hotplug(). Checking of the force_sd_rebuild flag is
now done at the end of cpuset_write_resmask() and update_prstate()
to determine if rebuild_sched_domains_locked() should be called or not.

The cpuset v1 code can still call rebuild_sched_domains_locked()
directly as double rebuild_sched_domains_locked() calls is not possible.

Reported-by: Juri Lelli <juri.lelli@redhat.com>
Closes: https://lore.kernel.org/lkml/ZyuUcJDPBln1BK1Y@jlelli-thinkpadt14gen4.remote.csb/
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-12 09:07:09 -10:00
Waiman Long
bcd7012afd cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()"
Revert commit 3ae0b773211e ("cgroup/cpuset: Allow suppression of sched
domain rebuild in update_cpumasks_hier()") to allow for an alternative
way to suppress unnecessary rebuild_sched_domains_locked() calls in
update_cpumasks_hier() and elsewhere in a following commit.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-12 09:07:01 -10:00
Al Viro
457a654939 css_set_fork(): switch to CLASS(fd_raw, ...)
reference acquired there by fget_raw() is not stashed anywhere -
we could as well borrow instead.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:07 -05:00
Al Viro
048181992c fdget_raw() users: switch to CLASS(fd_raw)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Chen Ridong
16e83007cd cgroup/freezer: Add cgroup CGRP_FROZEN flag update helper
Add help to update cgroup CGRP_FROZEN flag. Both cgroup_propagate_frozen
and cgroup_update_frozen functions update CGRP_FROZEN flag, this makes
code concise.

Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-23 09:45:09 -10:00
Chen Ridong
ee1251fc0c cgroup/freezer: Reduce redundant traversal for cgroup_freeze
Whether a cgroup is frozen is determined solely by whether it is set to
to be frozen and whether its parent is frozen. Currently, when is cgroup
is frozen or unfrozen, it iterates through the entire subtree to freeze
or unfreeze its descentdants. However, this is unesessary for a cgroup
that does not change its effective frozen status. This path aims to skip
the subtree if its parent does not have a change in effective freeze.

For an example, subtree like, a-b-c-d-e-f-g, when a is frozen, the
entire tree is frozen. If we freeze b and c again, it is unesessary to
iterate d, e, f and g. So does that If we unfreeze b/c.

Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-23 09:45:00 -10:00
Chen Ridong
2190df6c91 cgroup/bpf: only cgroup v2 can be attached by bpf programs
Only cgroup v2 can be attached by bpf programs, so this patch introduces
that cgroup_bpf_inherit and cgroup_bpf_offline can only be called in
cgroup v2, and this can fix the memleak mentioned by commit 04f8ef5643bc
("cgroup: Fix memory leak caused by missing cgroup_bpf_offline"), which
has been reverted.

Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-21 10:02:35 -10:00
Chen Ridong
feb301c609 Revert "cgroup: Fix memory leak caused by missing cgroup_bpf_offline"
This reverts commit 04f8ef5643bcd8bcde25dfdebef998aea480b2ba.

Only cgroup v2 can be attached by cgroup by BPF programs. Revert this
commit and cgroup_bpf_inherit and cgroup_bpf_offline won't be called in
cgroup v1. The memory leak issue will be fixed with next patch.

Fixes: 04f8ef5643bc ("cgroup: Fix memory leak caused by missing cgroup_bpf_offline")
Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-21 10:02:26 -10:00
Xiu Jianfeng
3cc4e13bb1 cgroup: Fix potential overflow issue when checking max_depth
cgroup.max.depth is the maximum allowed descent depth below the current
cgroup. If the actual descent depth is equal or larger, an attempt to
create a new child cgroup will fail. However due to the cgroup->max_depth
is of int type and having the default value INT_MAX, the condition
'level > cgroup->max_depth' will never be satisfied, and it will cause
an overflow of the level after it reaches to INT_MAX.

Fix it by starting the level from 0 and using '>=' instead.

It's worth mentioning that this issue is unlikely to occur in reality,
as it's impossible to have a depth of INT_MAX hierarchy, but should be
be avoided logically.

Fixes: 1a926e0bbab8 ("cgroup: implement hierarchy limits")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-14 13:39:25 -10:00
Joshua Hahn
aefa398d93 cgroup/rstat: Tracking cgroup-level niced CPU time
Cgroup-level CPU statistics currently include time spent on
user/system processes, but do not include niced CPU time (despite
already being tracked). This patch exposes niced CPU time to the
userspace, allowing users to get a better understanding of their
hardware limits and can facilitate more informed workload distribution.

A new field 'ntime' is added to struct cgroup_base_stat as opposed to
struct task_cputime to minimize footprint.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-08 08:50:48 -10:00
everestkc
95a616d89c cgroup/cpuset: Fix spelling errors in file kernel/cgroup/cpuset.c
Corrected the spelling errors repoted by codespell as follows:
	temparary ==> temporary
        Proprogate ==> Propagate
        constrainted ==> constrained

Signed-off-by: Everest K.C. <everestkc@everestkc.com.np>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-30 09:00:23 -10:00