* kvm-arm64/pv-cpuid:
: Paravirtualized implementation ID, courtesy of Shameer Kolothum
:
: Big-little has historically been a pain in the ass to virtualize. The
: implementation ID (MIDR, REVIDR, AIDR) of a vCPU can change at the whim
: of vCPU scheduling. This can be particularly annoying when the guest
: needs to know the underlying implementation to mitigate errata.
:
: "Hyperscalers" face a similar scheduling problem, where VMs may freely
: migrate between hosts in a pool of heterogenous hardware. And yes, our
: server-class friends are equally riddled with errata too.
:
: In absence of an architected solution to this wart on the ecosystem,
: introduce support for paravirtualizing the implementation exposed
: to a VM, allowing the VMM to describe the pool of implementations that a
: VM may be exposed to due to scheduling/migration.
:
: Userspace is expected to intercept and handle these hypercalls using the
: SMCCC filter UAPI, should it choose to do so.
smccc: kvm_guest: Fix kernel builds for 32 bit arm
KVM: selftests: Add test for KVM_REG_ARM_VENDOR_HYP_BMAP_2
smccc/kvm_guest: Enable errata based on implementation CPUs
arm64: Make _midr_in_range_list() an exported function
KVM: arm64: Introduce KVM_REG_ARM_VENDOR_HYP_BMAP_2
KVM: arm64: Specify hypercall ABI for retrieving target implementations
arm64: Modify _midr_range() functions to read MIDR/REVIDR internally
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
* kvm-arm64/nv-vgic:
: NV VGICv3 support, courtesy of Marc Zyngier
:
: Support for emulating the GIC hypervisor controls and managing shadow
: VGICv3 state for the L1 hypervisor. As part of it, bring in support for
: taking IRQs to the L1 and UAPI to manage the VGIC maintenance interrupt.
KVM: arm64: nv: Fail KVM init if asking for NV without GICv3
KVM: arm64: nv: Allow userland to set VGIC maintenance IRQ
KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup
KVM: arm64: nv: Propagate used_lrs between L1 and L0 contexts
KVM: arm64: nv: Request vPE doorbell upon nested ERET to L2
KVM: arm64: nv: Respect virtual HCR_EL2.TWx setting
KVM: arm64: nv: Add Maintenance Interrupt emulation
KVM: arm64: nv: Handle L2->L1 transition on interrupt injection
KVM: arm64: nv: Nested GICv3 emulation
KVM: arm64: nv: Sanitise ICH_HCR_EL2 accesses
KVM: arm64: nv: Plumb handling of GICv3 EL2 accesses
KVM: arm64: nv: Add ICH_*_EL2 registers to vpcu_sysreg
KVM: arm64: nv: Load timer before the GIC
arm64: sysreg: Add layout for ICH_MISR_EL2
arm64: sysreg: Add layout for ICH_VTR_EL2
arm64: sysreg: Add layout for ICH_HCR_EL2
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The VGIC maintenance IRQ signals various conditions about the LRs, when
the GIC's virtualization extension is used.
So far we didn't need it, but nested virtualization needs to know about
this interrupt, so add a userland interface to setup the IRQ number.
The architecture mandates that it must be a PPI, on top of that this code
only exports a per-device option, so the PPI is the same on all VCPUs.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
[added some bits of documentation]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-16-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Popular HW that is able to use NV also has a broken vgic implementation
that requires trapping.
On such HW, propagate the host trap bits into the guest's shadow
ICH_HCR_EL2 register, making sure we don't allow an L2 guest to bring
the system down.
This involves a bit of tweaking so that the emulation code correctly
poicks up the shadow state as needed, and to only partially sync
ICH_HCR_EL2 back with the guest state to capture EOIcount.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-15-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We have so far made sure that L1 and L0 vgic contexts were
totally independent. There is however one spot of bother with
this approach, and that's in the GICv3 emulation code required by
our fruity friends.
The issue is that the emulation code needs to know how many LRs
are in flight. And while it is easy to reach the L0 version through
the vcpu pointer, doing so for the L1 is much more complicated,
as these structures are private to the nested code.
We could simply expose that structure and pick one or the other
depending on the context, but this seems extra complexity for not
much benefit.
Instead, just propagate the number of used LRs from the nested code
into the L0 context, and be done with it. Should this become a burden,
it can be easily rectified.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-14-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Running an L2 guest with GICv4 enabled goes absolutely nowhere, and gets
into a vicious cycle of nested ERET followed by nested exception entry
into the L1.
When KVM does a put on a runnable vCPU, it marks the vPE as nonresident
but does not request a doorbell IRQ. Behind the scenes in the ITS
driver's view of the vCPU, its_vpe::pending_last gets set to true to
indicate that context is still runnable.
This comes to a head when doing the nested ERET into L2. The vPE doesn't
get scheduled on the redistributor as it is exclusively part of the L1's
VGIC context. kvm_vgic_vcpu_pending_irq() returns true because the vPE
appears runnable, and KVM does a nested exception entry into the L1
before L2 ever gets off the ground.
This issue can be papered over by requesting a doorbell IRQ when
descheduling a vPE as part of a nested ERET. KVM needs this anyway to
kick the vCPU out of the L2 when an IRQ becomes pending for the L1.
Link: https://lore.kernel.org/r/20240823212703.3576061-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-13-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Emulating the vGIC means emulating the dreaded Maintenance Interrupt.
This is a two-pronged problem:
- while running L2, getting an MI translates into an MI injected
in the L1 based on the state of the HW.
- while running L1, we must accurately reflect the state of the
MI line, based on the in-memory state.
The MI INTID is added to the distributor, as expected on any
virtualisation-capable implementation, and further patches
will allow its configuration.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-11-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
An interrupt being delivered to L1 while running L2 must result
in the correct exception being delivered to L1.
This means that if, on entry to L2, we found ourselves with pending
interrupts in the L1 distributor, we need to take immediate action.
This is done by posting a request which will prevent the entry in
L2, and deliver an IRQ exception to L1, forcing the switch.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When entering a nested VM, we set up the hypervisor control interface
based on what the guest hypervisor has set. Especially, we investigate
each list register written by the guest hypervisor whether HW bit is
set. If so, we translate hw irq number from the guest's point of view
to the real hardware irq number if there is a mapping.
Co-developed-by: Jintack Lim <jintack@cs.columbia.edu>
Signed-off-by: Jintack Lim <jintack@cs.columbia.edu>
[Christoffer: Redesigned execution flow around vcpu load/put]
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[maz: Rewritten to support GICv3 instead of GICv2, NV2 support]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-9-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Wire the handling of all GICv3 EL2 registers, and provide emulation
for all the non memory-backed registers (ICC_SRE_EL2, ICH_VTR_EL2,
ICH_MISR_EL2, ICH_ELRSR_EL2, and ICH_EISR_EL2).
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-7-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The ICH_VTR_EL2-related macros are missing a number of config
bits that we are about to handle. Take this opportunity to fully
describe the layout of that register as part of the automatic
generation infrastructure.
This results in a bit of churn to repaint constants that are now
generated with a different format.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The ICH_HCR_EL2-related macros are missing a number of control
bits that we are about to handle. Take this opportunity to fully
describe the layout of that register as part of the automatic
generation infrastructure.
This results in a bit of churn, unfortunately.
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
These changes lay the groundwork for adding support for guest kernels,
allowing them to leverage target CPU implementations provided by the
VMM.
No functional changes intended.
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250221140229.12588-2-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Continuing with the theme of broken VMMs and guests, irqbypass
registration can fail if the virtual ITS lacks a translation for the
MSI. Either the guest hasn't mapped it or userspace may have forgotten
to restore the ITS.
Exit silently and allow irqbypass configuration to succeed. As a reward
for ingenuity, LPIs are demoted to software injection.
Tested-by: Sudheer Dantuluri <dantuluris@google.com>
Fixes: 196b136498b3 ("KVM: arm/arm64: GICv4: Wire mapping/unmapping of VLPIs in VFIO irq bypass")
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250226183124.82094-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The VMM or guest can easily screw up GICv4 vLPI injection by
misconfiguring the MSI or the virtual ITS. Don't fuss over it; limit the
WARN in vgic_v4_unset_forwarding() to fire in the worrying case where an
unrelated HW IRQ was mapped to a vLPI.
Reported-by: Sudheer Dantuluri <dantuluris@google.com>
Tested-by: Sudheer Dantuluri <dantuluris@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250226183124.82094-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
If userspace creates vcpus, then a vgic, we end-up in a situation
where irqchip_in_kernel() will return true, but no private interrupt
has been allocated for these vcpus. This situation will continue
until userspace initialises the vgic, at which point we fix the
early vcpus. Should a vcpu run or be initialised in the interval,
bad things may happen.
An obvious solution is to move this fix-up phase to the point where
the vgic is created. This ensures that from that point onwards,
all vcpus have their private interrupts, as new vcpus will directly
allocate them.
With that, we have the invariant that when irqchip_in_kernel() is
true, all vcpus have their private interrupts.
Reported-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250212182558.2865232-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/misc-6.14:
: .
: Misc KVM/arm64 changes for 6.14
:
: - Don't expose AArch32 EL0 capability when NV is enabled
:
: - Update documentation to reflect the full gamut of kvm-arm.mode
: behaviours
:
: - Use the hypervisor VA bit width when dumping stacktraces
:
: - Decouple the hypervisor stack size from PAGE_SIZE, at least
: on the surface...
:
: - Make use of str_enabled_disabled() when advertising GICv4.1 support
:
: - Explicitly handle BRBE traps as UNDEFINED
: .
KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
KVM: arm64: vgic: Use str_enabled_disabled() in vgic_v3_probe()
arm64: kvm: Introduce nvhe stack size constants
KVM: arm64: Fix nVHE stacktrace VA bits mask
Documentation: Update the behaviour of "kvm-arm.mode"
KVM: arm64: nv: Advertise the lack of AArch32 EL0 support
Signed-off-by: Marc Zyngier <maz@kernel.org>
Rather than look-up the hyp vCPU on every run hypercall at EL2,
introduce a per-CPU 'loaded_hyp_vcpu' tracking variable which is updated
by a pair of load/put hypercalls called directly from
kvm_arch_vcpu_{load,put}() when pKVM is enabled.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-10-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The return value of xa_store() needs to be checked. This fix adds an
error handling path that resolves the kref inconsistency on failure. As
suggested by Oliver Upton, this function does not return the error code
intentionally because the translation cache is best effort.
Fixes: 8201d1028caa ("KVM: arm64: vgic-its: Maintain a translation cache per ITS")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241130144952.23729-1-keisuke.nishimura@inria.fr
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The ITS ABI infrastructure allows for some pretty lax code, where
the size of the data doesn't have to match the size of the entry,
potentially leading to a collection of interesting bugs.
Commit 7fe28d7e68f9 ("KVM: arm64: vgic-its: Add a data length check
in vgic_its_save_*") added some checks, but starts by implicitly
casting all writes to a 64bit value, hiding some of the issues.
Instead, introduce macros that will check the data type actually used
for dealing with the table entries. The macros are taking a symbolic
entry type that is used to fetch the size of the entry type for the
current ABI. This immediately catches a couple of low-impact gotchas
(zero values that are implicitly 32bit), easy enough to fix.
Given that we currently only have a single ABI, hardcode a couple of
BUILD_BUG_ON()s that will fire if we use anything but a 64bit quantity,
and some (currently unreachable) fallback code that may become useful
one day.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
VGIC_MAX_PRIVATE is a pretty useless definition, and is better
replaced with VGIC_NR_PRIVATE_IRQS.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
vgic_get_irq() has an awkward signature, as it takes both a kvm
*and* a vcpu, where the vcpu is allowed to be NULL if the INTID
being looked up is a global interrupt (SPI or LPI).
This leads to potentially problematic situations where the INTID
passed is a private interrupt, but that there is no vcpu.
In order to make things less ambiguous, let have *two* helpers
instead:
- vgic_get_irq(struct kvm *kvm, u32 intid), which is only concerned
with *global* interrupts, as indicated by the lack of vcpu.
- vgic_get_vcpu_irq(struct kvm_vcpu *vcpu, u32 intid), which can
return *any* interrupt class, but must have of course a non-NULL
vcpu.
Most of the code nicely falls under one or the other situations,
except for a couple of cases (close to the UABI or in the debug code)
where we have to distinguish between the two cases.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Make sure we filter out non-LPI invalidation when handling writes
to GICR_INVLPIR.
Fixes: 4645d11f4a553 ("KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation")
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241117165757.247686-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When DISCARD frees an ITE, it does not invalidate the
corresponding ITE. In the scenario of continuous saves and
restores, there may be a situation where an ITE is not saved
but is restored. This is unreasonable and may cause restore
to fail. This patch clears the corresponding ITE when DISCARD
frees an ITE.
Cc: stable@vger.kernel.org
Fixes: eff484e0298d ("KVM: arm64: vgic-its: ITT save and restore")
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with entry write helper]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-6-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
vgic_its_save_device_tables will traverse its->device_list to
save DTE for each device. vgic_its_restore_device_tables will
traverse each entry of device table and check if it is valid.
Restore if valid.
But when MAPD unmaps a device, it does not invalidate the
corresponding DTE. In the scenario of continuous saves
and restores, there may be a situation where a device's DTE
is not saved but is restored. This is unreasonable and may
cause restore to fail. This patch clears the corresponding
DTE when MAPD unmaps a device.
Cc: stable@vger.kernel.org
Fixes: 57a9a117154c ("KVM: arm64: vgic-its: Device table save/restore")
Co-developed-by: Shusen Li <lishusen2@huawei.com>
Signed-off-by: Shusen Li <lishusen2@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with entry write helper]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-5-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
In all the vgic_its_save_*() functinos, they do not check whether
the data length is 8 bytes before calling vgic_write_guest_lock.
This patch adds the check. To prevent the kernel from being blown up
when the fault occurs, KVM_BUG_ON() is used. And the other BUG_ON()s
are replaced together.
Cc: stable@vger.kernel.org
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with the new entry read/write helpers]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-4-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
kvm_vgic_map_resources() prematurely marks the distributor as 'ready',
potentially allowing vCPUs to enter the guest before the distributor's
MMIO registration has been made visible.
Plug the race by marking the distributor as ready only after MMIO
registration is completed. Rely on the implied ordering of
synchronize_srcu() to ensure the MMIO registration is visible before
vgic_dist::ready. This also means that writers to vgic_dist::ready are
now serialized by the slots_lock, which was effectively the case already
as all writers held the slots_lock in addition to the config_lock.
Fixes: 59112e9c390b ("KVM: arm64: vgic: Fix a circular locking issue")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241017001947.2707312-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
KVM commits to a particular sizing of SPIs when the vgic is initialized,
which is before the point a vgic becomes ready. On top of that, KVM
supplies a default amount of SPIs should userspace not explicitly
configure this.
As such, the check for vgic_ready() in the handling of
KVM_DEV_ARM_VGIC_GRP_NR_IRQS is completely wrong, and testing if nr_spis
is nonzero is sufficient for preventing userspace from playing games
with us.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241017001947.2707312-2-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
As there is very little ordering in the KVM API, userspace can
instanciate a half-baked GIC (missing its memory map, for example)
at almost any time.
This means that, with the right timing, a thread running vcpu-0
can enter the kernel without a GIC configured and get a GIC created
behind its back by another thread. Amusingly, it will pick up
that GIC and start messing with the data structures without the
GIC having been fully initialised.
Similarly, a thread running vcpu-1 can enter the kernel, and try
to init the GIC that was previously created. Since this GIC isn't
properly configured (no memory map), it fails to correctly initialise.
And that's the point where we decide to teardown the GIC, freeing all
its resources. Behind vcpu-0's back. Things stop pretty abruptly,
with a variety of symptoms. Clearly, this isn't good, we should be
a bit more careful about this.
It is obvious that this guest is not viable, as it is missing some
important part of its configuration. So instead of trying to tear
bits of it down, let's just mark it as *dead*. It means that any
further interaction from userspace will result in -EIO. The memory
will be released on the "normal" path, when userspace gives up.
Cc: stable@vger.kernel.org
Reported-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241009183603.3221824-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Alex reports that syzkaller has managed to trigger a use-after-free when
tearing down a VM:
BUG: KASAN: slab-use-after-free in kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769
Read of size 8 at addr ffffff801c6890d0 by task syz.3.2219/10758
CPU: 3 UID: 0 PID: 10758 Comm: syz.3.2219 Not tainted 6.11.0-rc6-dirty #64
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x17c/0x1a8 arch/arm64/kernel/stacktrace.c:317
show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:324
__dump_stack lib/dump_stack.c:93 [inline]
dump_stack_lvl+0x94/0xc0 lib/dump_stack.c:119
print_report+0x144/0x7a4 mm/kasan/report.c:377
kasan_report+0xcc/0x128 mm/kasan/report.c:601
__asan_report_load8_noabort+0x20/0x2c mm/kasan/report_generic.c:381
kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769
kvm_vm_release+0x4c/0x60 virt/kvm/kvm_main.c:1409
__fput+0x198/0x71c fs/file_table.c:422
____fput+0x20/0x30 fs/file_table.c:450
task_work_run+0x1cc/0x23c kernel/task_work.c:228
do_notify_resume+0x144/0x1a0 include/linux/resume_user_mode.h:50
el0_svc+0x64/0x68 arch/arm64/kernel/entry-common.c:169
el0t_64_sync_handler+0x90/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
Upon closer inspection, it appears that we do not properly tear down the
MMIO registration for a vCPU that fails creation late in the game, e.g.
a vCPU w/ the same ID already exists in the VM.
It is important to consider the context of commit that introduced this bug
by moving the unregistration out of __kvm_vgic_vcpu_destroy(). That
change correctly sought to avoid an srcu v. config_lock inversion by
breaking up the vCPU teardown into two parts, one guarded by the
config_lock.
Fix the use-after-free while avoiding lock inversion by adding a
special-cased unregistration to __kvm_vgic_vcpu_destroy(). This is safe
because failed vCPUs are torn down outside of the config_lock.
Cc: stable@vger.kernel.org
Fixes: f616506754d3 ("KVM: arm64: vgic: Don't hold config_lock while unregistering redistributors")
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007223909.2157336-1-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
In order to be consistent, we shouldn't advertise a GICv3 when none
is actually usable by the guest.
Wipe the feature when these conditions apply, and allow the field
to be written from userspace.
This now allows us to rewrite the kvm_has_gicv3 helper() in terms
of kvm_has_feat(), given that it is always evaluated at runtime.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
On a VHE system, no GICv3 traps get configured when no irqchip is
present. This is not quite matching the "no GICv3" semantics that
we want to present.
Force such traps to be configured in this case.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
We so far only write the ICH_HCR_EL2 config in two situations:
- when we need to emulate the GICv3 CPU interface due to HW bugs
- when we do direct injection, as the virtual CPU interface needs
to be enabled
This is all good. But it also means that we don't do anything special
when we emulate a GICv2, or that there is no GIC at all.
What happens in this case when the guest uses the GICv3 system
registers? The *guest* gets a trap for a sysreg access (EC=0x18)
while we'd really like it to get an UNDEF.
Fixing this is a bit involved:
- we need to set all the required trap bits (TC, TALL0, TALL1, TDIR)
- for these traps to take effect, we need to (counter-intuitively)
set ICC_SRE_EL1.SRE to 1 so that the above traps take priority.
Note that doesn't fully work when GICv2 emulation is enabled, as
we cannot set ICC_SRE_EL1.SRE to 1 (it breaks Group0 delivery as
IRQ).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Follow the pattern introduced with vcpu_set_hcr(), and introduce
vcpu_set_ich_hcr(), which configures the GICv3 traps at the same
point.
This will allow future changes to introduce trap configuration on
a per-VM basis.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827152517.3909653-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
On a system with a GICv3, if a guest hasn't been configured with
GICv3 and that the host is not capable of GICv2 emulation,
a write to any of the ICC_*SGI*_EL1 registers is trapped to EL2.
We therefore try to emulate the SGI access, only to hit a NULL
pointer as no private interrupt is allocated (no GIC, remember?).
The obvious fix is to give the guest what it deserves, in the
shape of a UNDEF exception.
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240820100349.3544850-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
We recently moved the teardown of the vgic part of a vcpu inside
a critical section guarded by the config_lock. This teardown phase
involves calling into kvm_io_bus_unregister_dev(), which takes the
kvm->srcu lock.
However, this violates the established order where kvm->srcu is
taken on a memory fault (such as an MMIO access), possibly
followed by taking the config_lock if the GIC emulation requires
mutual exclusion from the other vcpus.
It therefore results in a bad lockdep splat, as reported by Zenghui.
Fix this by moving the call to kvm_io_bus_unregister_dev() outside
of the config_lock critical section. At this stage, there shouln't
be any need to hold the config_lock.
As an additional bonus, document the ordering between kvm->slots_lock,
kvm->srcu and kvm->arch.config_lock so that I cannot pretend I didn't
know about those anymore.
Fixes: 9eb18136af9f ("KVM: arm64: vgic: Hold config_lock while tearing down a CPU interface")
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Tested-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20240819125045.3474845-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
If there were LPIs being mapped behind our back (i.e., between .start() and
.stop()), we would put them at iter_unmark_lpis() without checking if they
were actually *marked*, which is obviously not good.
Switch to use the xa_for_each_marked() iterator to fix it.
Cc: stable@vger.kernel.org
Fixes: 85d3ccc8b75b ("KVM: arm64: vgic-debug: Use an xarray mark for debug iterator")
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240817101541.1664-1-yuzenghui@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Tearing down a vcpu CPU interface involves freeing the private interrupt
array. If we don't hold the lock, we may race against another thread
trying to configure it. Yeah, fuzzers do wonderful things...
Taking the lock early solves this particular problem.
Fixes: 03b3d00a70b5 ("KVM: arm64: vgic: Allocate private interrupts on demand")
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240808091546.3262111-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
In case the guest doesn't have any LPI, we previously relied on the
iterator setting
'intid = nr_spis + VGIC_NR_PRIVATE_IRQS' && 'lpi_idx = 1'
to exit the iterator. But it was broken with commit 85d3ccc8b75b ("KVM:
arm64: vgic-debug: Use an xarray mark for debug iterator") -- the intid
remains at 'nr_spis + VGIC_NR_PRIVATE_IRQS - 1', and we end up endlessly
printing the last SPI's state.
Consider that it's meaningless to search the LPI xarray and populate
lpi_idx when there is no LPI, let's just skip the process for that case.
The result is that
* If there's no LPI, we focus on the intid and exit the iterator when it
runs out of the valid SPI range.
* Otherwise we keep the current logic and let the xarray drive the
iterator.
Fixes: 85d3ccc8b75b ("KVM: arm64: vgic-debug: Use an xarray mark for debug iterator")
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240807052024.2084-1-yuzenghui@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Get rid of unexpected unlock sparse warnings in vgic code
by adding an annotation to vgic_queue_irq_unlock().
arch/arm64/kvm/vgic/vgic.c:334:17: warning: context imbalance in 'vgic_queue_irq_unlock' - unexpected unlock
arch/arm64/kvm/vgic/vgic.c:419:5: warning: context imbalance in 'kvm_vgic_inject_irq' - different lock contexts for basic block
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240723101204.7356-4-sebott@redhat.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Fix kdoc warnings by adding missing function parameter
descriptions or by conversion to a normal comment.
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240723101204.7356-3-sebott@redhat.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When tearing down a redistributor region, make sure we don't have
any dangling pointer to that region stored in a vcpu.
Fixes: e5a35635464b ("kvm: arm64: vgic-v3: Introduce vgic_v3_free_redist_region()")
Reported-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240605175637.1635653-1-maz@kernel.org
Cc: stable@vger.kernel.org
- Move a lot of state that was previously stored on a per vcpu
basis into a per-CPU area, because it is only pertinent to the
host while the vcpu is loaded. This results in better state
tracking, and a smaller vcpu structure.
- Add full handling of the ERET/ERETAA/ERETAB instructions in
nested virtualisation. The last two instructions also require
emulating part of the pointer authentication extension.
As a result, the trap handling of pointer authentication has
been greattly simplified.
- Turn the global (and not very scalable) LPI translation cache
into a per-ITS, scalable cache, making non directly injected
LPIs much cheaper to make visible to the vcpu.
- A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
- Allocate PPIs and SGIs outside of the vcpu structure, allowing
for smaller EL2 mapping and some flexibility in implementing
more or less than 32 private IRQs.
- Purge stale mpidr_data if a vcpu is created after the MPIDR
map has been created.
- Preserve vcpu-specific ID registers across a vcpu reset.
- Various minor cleanups and improvements.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmY/PT4ACgkQI9DQutE9
ekNwSA/7BTro0n5gP5/SfSFJeEedigpmHQJtHJk9og0LBzjXZTvYqKpI5J1HnpWE
AFsDf3aDRPaSCvI+S14LkkK+TmGtVEXUg8YGytQo08IcO2x6xBT/YjpkVOHy23kq
SGgNMPNUH2sycb7hTcz9Z/V0vBeYwFzYEAhmpvtROvmaRd8ZIyt+ofcclwUZZAQ2
SolOXR2d+ynCh8ZCOexqyZ67keikW1NXtW5aNWWFc6S6qhmcWdaWJGDcSyHauFac
+YuHjPETJYh7TNpwYTmKclRh1fk/CgA/e+r71Hlgdkg+DGCyVnEZBQxqMi6GTzNC
dzy3qhTtRT61SR54q55yMVIC3o6uRSkht+xNg1Nd+UghiqGKAtoYhvGjduodONW2
1Eas6O+vHipu98HgFnkJRPlnF1HR3VunPDwpzIWIZjK0fIXEfrWqCR3nHFaxShOR
dniTEPfELguxOtbl3jCZ+KHCIXueysczXFlqQjSDkg/P1l0jKBgpkZzMPY2mpP1y
TgjipfSL5gr1GPdbrmh4WznQtn5IYWduKIrdEmSBuru05OmBaCO4geXPUwL4coHd
O8TBnXYBTN/z3lORZMSOj9uK8hgU1UWmnOIkdJ4YBBAL8DSS+O+KtCRkHQP0ghl+
whl0q1SWTu4LtOQzN5CUrhq9Tge11erEt888VyJbBJmv8x6qJjE=
=CEfD
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 6.10
- Move a lot of state that was previously stored on a per vcpu
basis into a per-CPU area, because it is only pertinent to the
host while the vcpu is loaded. This results in better state
tracking, and a smaller vcpu structure.
- Add full handling of the ERET/ERETAA/ERETAB instructions in
nested virtualisation. The last two instructions also require
emulating part of the pointer authentication extension.
As a result, the trap handling of pointer authentication has
been greattly simplified.
- Turn the global (and not very scalable) LPI translation cache
into a per-ITS, scalable cache, making non directly injected
LPIs much cheaper to make visible to the vcpu.
- A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
- Allocate PPIs and SGIs outside of the vcpu structure, allowing
for smaller EL2 mapping and some flexibility in implementing
more or less than 32 private IRQs.
- Purge stale mpidr_data if a vcpu is created after the MPIDR
map has been created.
- Preserve vcpu-specific ID registers across a vcpu reset.
- Various minor cleanups and improvements.
* kvm-arm64/misc-6.10:
: .
: Misc fixes and updates targeting 6.10
:
: - Improve boot-time diagnostics when the sysreg tables
: are not correctly sorted
:
: - Allow FFA_MSG_SEND_DIRECT_REQ in the FFA proxy
:
: - Fix duplicate XNX field in the ID_AA64MMFR1_EL1
: writeable mask
:
: - Allocate PPIs and SGIs outside of the vcpu structure, allowing
: for smaller EL2 mapping and some flexibility in implementing
: more or less than 32 private IRQs.
:
: - Use bitmap_gather() instead of its open-coded equivalent
:
: - Make protected mode use hVHE if available
:
: - Purge stale mpidr_data if a vcpu is created after the MPIDR
: map has been created
: .
KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
KVM: arm64: Fix hvhe/nvhe early alias parsing
KVM: arm64: Convert kvm_mpidr_index() to bitmap_gather()
KVM: arm64: vgic: Allocate private interrupts on demand
KVM: arm64: Remove duplicated AA64MMFR1_EL1 XNX
KVM: arm64: Remove FFA_MSG_SEND_DIRECT_REQ from the denylist
KVM: arm64: Improve out-of-order sysreg table diagnostics
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/pkvm-6.10: (25 commits)
: .
: At last, a bunch of pKVM patches, courtesy of Fuad Tabba.
: From the cover letter:
:
: "This series is a bit of a bombay-mix of patches we've been
: carrying. There's no one overarching theme, but they do improve
: the code by fixing existing bugs in pKVM, refactoring code to
: make it more readable and easier to re-use for pKVM, or adding
: functionality to the existing pKVM code upstream."
: .
KVM: arm64: Force injection of a data abort on NISV MMIO exit
KVM: arm64: Restrict supported capabilities for protected VMs
KVM: arm64: Refactor setting the return value in kvm_vm_ioctl_enable_cap()
KVM: arm64: Document the KVM/arm64-specific calls in hypercalls.rst
KVM: arm64: Rename firmware pseudo-register documentation file
KVM: arm64: Reformat/beautify PTP hypercall documentation
KVM: arm64: Clarify rationale for ZCR_EL1 value restored on guest exit
KVM: arm64: Introduce and use predicates that check for protected VMs
KVM: arm64: Add is_pkvm_initialized() helper
KVM: arm64: Simplify vgic-v3 hypercalls
KVM: arm64: Move setting the page as dirty out of the critical section
KVM: arm64: Change kvm_handle_mmio_return() return polarity
KVM: arm64: Fix comment for __pkvm_vcpu_init_traps()
KVM: arm64: Prevent kmemleak from accessing .hyp.data
KVM: arm64: Do not map the host fpsimd state to hyp in pKVM
KVM: arm64: Rename __tlb_switch_to_{guest,host}() in VHE
KVM: arm64: Support TLB invalidation in guest context
KVM: arm64: Avoid BBM when changing only s/w bits in Stage-2 PTE
KVM: arm64: Check for PTE validity when checking for executable/cacheable
KVM: arm64: Avoid BUG-ing from the host abort path
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Private interrupts are currently part of the CPU interface structure
that is part of each and every vcpu we create.
Currently, we have 32 of them per vcpu, resulting in a per-vcpu array
that is just shy of 4kB. On its own, that's no big deal, but it gets
in the way of other things:
- each vcpu gets mapped at EL2 on nVHE/hVHE configurations. This
requires memory that is physically contiguous. However, the EL2
code has no purpose looking at the interrupt structures and
could do without them being mapped.
- supporting features such as EPPIs, which extend the number of
private interrupts past the 32 limit would make the array
even larger, even for VMs that do not use the EPPI feature.
Address these issues by moving the private interrupt array outside
of the vcpu, and replace it with a simple pointer. We take this
opportunity to make it obvious what gets initialised when, as
that path was remarkably opaque, and tighten the locking.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240502154545.3012089-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Consolidate the GICv3 VMCR accessor hypercalls into the APR save/restore
hypercalls so that all of the EL2 GICv3 state is covered by a single pair
of hypercalls.
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-17-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The last genuine use case for the lpi_list_lock was the global LPI
translation cache, which has been removed in favor of a per-ITS xarray.
Remove a layer from the locking puzzle by getting rid of it.
vgic_add_lpi() still has a critical section that needs to protect
against the insertion of other LPIs; change it to take the LPI xarray's
xa_lock to retain this property.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-13-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>