vhost-net: guest to host kernel escape during migration
A buffer overflow vulnerability was found in the networking virtualization functionality (vhost-net) that could be abused during live migration of virtual machines. A privileged guest user may pass descriptors with invalid length to the host when live migration is underway to crash the host kernel or, potentially, escalate their privileges on the host.
The warning in mem_cgroup_reparent_charges() was triggered too early and too often in certain cases.
kvm: potential system hang due to an error in mmu_shrink_scan().
nfs: NULL pointer dereference due to an anomalized NFS message sequence.
An attacker, who is able to mount an exported NFS filesystem, is able to trigger a null pointer dereference by using an invalid NFS sequence. This can panic the machine and deny access to the NFS server. Any outstanding disk writes to the NFS server will be lost.
fuse_kio_pcs: kernel crash in pcs_sockio_xmit().
Processes could get stuck in copy_net_ns() forever.
vziolimit: kernel crash due to a division by zero in throttle_charge().
mem_cgroup_reparent_charges() could get stuck while holding cgroup_mutex and make the whole system hang.
kvm: inefficient memory shrinking for VMs.
It was discovered that a node with dozens of CPU cores, lots of RAM and many VMs running could get into a situation when almost all CPU cores were busy in mmu_shrink_scan(). This could happen because memory shrinking was done under kvm_lock spinlock and only for one VM at a time. All CPU cores but one just waited for kvm_lock in such cases, while the last one was busy with the actual memory shrinking for a VM.
fuse_kio_pcs: latency was calculated incorrectly.
It was found that the in-kernel implementation of Virtuozzo Storage client stored latency values in milliseconds rather than in microseconds, resulting in bogus statistics data.
tcp: integer overflow while processing SACK blocks allows remote denial of service.
An integer overflow was found in the way the Linux kernel's networking subsystem processed TCP Selective Acknowledgment (SACK) segments. While processing SACK segments, the Linux kernel's socket buffer (SKB) data structure becomes fragmented. Each fragment is about TCP maximum segment size (MSS) bytes. To efficiently process SACK blocks, the Linux kernel merges multiple fragmented SKBs into one, potentially overflowing the variable holding the number of segments. A remote attacker could use this flaw to crash the Linux kernel by sending a crafted sequence of SACK segments on a TCP connection with small value of TCP MSS, resulting in a denial of service.
OOM killer would kill tasks from cgroups without memory guarantees first.
If the amount of free memory is low, OOM killer would kill the tasks from cgroups without memory guarantees first. However, it seems more reasonable to kill the tasks from cgroups exceeding their guarantees the most.
virtio_scsi: a race condition in the Linux block layer could cause certain I/O requests to hang.
ploop: kernel crash in ploop_congested().
ext4: inode tables created during online resize were not zeroed.
It was discovered that inode tables created during online resize of an ext4 filesystem were not zeroed after that. This could potentially result in lower performance of the filesystem.
Windows Server 2016 Essentials failed to install into a QEMU VM with disabled PMU.
It was found that if no PMU counters were exposed to guest, KVM skipped the whole remaining PMU-related initialization, including filling of LBR-related data. As it turned out, Windows Server 2016 Essentials tried to access these data during the installation and failed to install as a result.
ploop: 'pcompact' could hang if run simultaneously with 'ploop-balloon status'
Memory leak in the implementation of IPv4 routing.
It was discovered that a certain sequence of operations related to IPv4 routing could trigger a kernel memory leak. An attacker could potentially exploit that from a container to cause a denial of service.
KVM: potential use-after-free via kvm_ioctl_create_device().
A use-after-free vulnerability was found in the way KVM implements its device control API. When a device is created via kvm_ioctl_create_device(), it holds a reference to a VM object. This reference is transferred to file descriptor table of the caller. If such file descriptor was closed, reference count to the VM object could become zero, which could lead to a use-after-free issue. A user/process could use this flaw to crash the guest VM resulting in a denial of service or, potentially, gain privileged access to a system.
KVM: use-after-free in the emulation of the preemption timer for the L2 guest systems.
A use-after-free vulnerability was found in the way KVM emulates a preemption timer for L2 guests when nested virtualization is enabled. A guest user/process could use this flaw to crash the host kernel resulting in a denial of service or, potentially, gain privileged access to a system.
System hang (hard lockup) due to an infinite loop in calc_load_ve().
If some process held the CPU cgroup of a container while the container was being stopped, the kernel would try to add this cgroup to the list of such structures again when the container was started the next time. This would corrupt the list, and calc_load_ve() function would go in an endless loop as a result.
ploop: potential data corruption due to a race between 'prepare_merge' and 'submit_alloc' operations.
High order page allocations were made in neigh_probe() in certain cases.
High order page allocations were triggered by CRIU while restoring TCP sockets.
Network performance issues due to the usage of pfmemalloc reserves.
It was discovered that network drivers could allocate memory for the socket buffers from pfmemalloc memory reserves, even when it was unnecessary. As a result, the network packets were dropped by sk_filter_trim_cap() causing performance issues.
skb drops due to the usage of pfmemalloc reserves were difficult to debug.
Additional diagnostics was introduced to make it easier to detect and analyze skb drops due to the usage of pfmemalloc reserves.
KVM did not update CPUID bits OSXSAVE and OSPKE in some cases.
It was discovered that CPUID bits OSXSAVE and OSPKE were not updated properly by KVM when the guest system rebooted. As a result, the guest system could crash.
The per-container limit on the network interfaces was too low for Docker in some cases.
It was discovered that Docker running inside a Virtuozzo container could hit the limit on the network interfaces (256) when it tried to start 50+ its containers. This fix allows changing that limit for the running containers and increases the default limit to 1024.
txqueuelen could not be changed via SIOCSIFTXQLEN ioctl on the host.
Kernel crash in ext4_clear_inode().
A large tarball with a lot of small files can fail to unpack inside a container if kmem limit is set.
It was found that unpacking a large tarball with a lot of small files could fail inside a container. This could happen because kmem limit was hit prematurely, while reclaimable memory was still available.
sr_mod: kernel crash in sr_block_revalidate_disk().
overlayfs: kernel crash in may_open().
CVE-2019-10140. An attacker with local access can create a denial of service situation via NULL pointer dereference in ovl_posix_acl_create(). The ovl_create() function can return a positive number leading to a null pointer derference of path in may_open(). This can allow attackers with ability to create directories on overlayfs to crash the kernel creating a Denial Of Service (DOS).
Links to certain files in /proc/ inside containers were not validated.
It was discovered that a malicious user inside a Virtuozzo container could potentially overwrite "vzctl" binary on the host. The attacker could replace executables in that container with symlinks to /proc/self/exe. After that, "vzctl exec" called from the host to run one of such executables would try to run the host's "vzctl" there instead. If the attacker managed to intercept that, they would be able to change the contents of the host's "vzctl" binary. The issue is similar to CVE-2019-5736, but affects "vzctl" rather than "runc".
Kernel crash (BUG_ON) ploop_relocblks_ioc().
/proc/sys/net/core/somaxconn was not available in the containers.
userfaultfd bypasses tmpfs file permissions.
A flaw was found in the implementation of userfaultfd. An attacker is able to bypass file permissions on filesystems mounted with tmpfs/hugetlbs to modify a file and possibly disrupt normal system behaviour. At this time there is an understanding there is no crash or priviledge escalation but the impact of modifications on these filesystems of files in production systems may have adverse affects.
ipvs: an unneeded debug message is output when a network namespace is initialized.
Debug message 'IPVS: Creating netns size=... id=...' could be output many times to the system log when the network namespaces are initialized, making the log less readable.
'perf record -a' causes segfaults in applications executing vsyscalls.
Some operations with ebtables could consume large amounts of memory, resulting in DoS.
A flaw was found in the implementation of ebtables in the Linux kernel. A local attacker in a container could exploit it to consume large amounts of memory, eventually causing denial of service on the host.
Kernel crash (access out of bounds) in SyS_mincore().
vhost: kernel crash (access out of bounds) in memcpy_fromiovecend().
tcache was not shrunk in some situations.
NFS: use-after-free in svc_process_common().
A flaw was found in the implementation of NFS v4.1 in the Linux kernel. NFS v4.1 shares mounted in different network namespaces at the same time can make bc_svc_process() use wrong back-channel id and cause a use-after-free. A malicious user in a container can exploit this to cause a host kernel memory corruption and a system crash.
Memory corruption due to incorrect socket cloning.
Transforming an IPv6-socket to an IPv4, and then transforming it back to a listening socket could result in a kernel memory corruption. An unprivileged user on the host or in a container could exploit this to crash the kernel.
NULL pointer dereference in af_netlink.c: __netlink_ns_capable() allows for denial of service.
The Linux kernel was found to be vulnerable to a NULL pointer dereference bug in the __netlink_ns_capable() function in the net/netlink/af_netlink.c file. A local attacker could exploit this when a net namespace with a netnsid is assigned to cause a kernel panic and a denial of service.
Asynchronous discard requests could fail with EIO because ploop did not properly align them.
Some operations with NFS server running in a container could crash the host kernel.
It was discovered that a special sequence of operations involving NFS server in a container with FEATURES="nfsd=on" could crash the host kernel.
Data corruption after online resize of an empty ploop image located on Virtuozzo Storage.
cleancache: missing invalidation of an inode could cause data corruption.
Errors in the implementation of online resize in ext4 caused failures of ploop resize operations.
Potential kernel crash in cbt_flush_cpu_cache().
Ploop: integer overflow in the implementation of direct IO could lead to errors when resizing the ploop image.
Incorrect accounting of network namespaces in the error paths in copy_net_ns().
Use-after-free in the implementation of the shared memory.
A flaw was found in the implementation of the shared memory in the Linux kernel. shm_mmap() function did not always check if the underlying file structures were valid, which could lead to use-after-free. A local unprivileged user could exploit this to crash the kernel by executing a special sequence of system calls.
Use-after-free due to race condition in AF_PACKET implementation.
It was discovered that a race condition between packet_do_bind() and packet_notifier() in the implementation of AF_PACKET could lead to use-after-free. An unprivileged user on the host or in a container could exploit this to crash the kernel or, potentially, to escalate their privileges in the system.
Potential kernel crash in ext4_close_pfcache().
Integer overflow in create_elf_tables() function.
An integer overflow flaw was found in create_elf_tables(). An unprivileged local user with access to SUID (or otherwise privileged) binary could use this flaw to escalate their privileges on the system.
Bypass of the size restriction on the arguments and environment variables of a process.
The Linux kernel imposes a size limit on the memory needed to store the arguments and environment variables of a process, 1/4 of the maximum stack size (RLIMIT_STACK). However, the pointers to these data were not taken into account, which allowed attackers to bypass the limit and even exhaust the stack of the process.
Kernel crash in __run_hrtimer().
It was found that the implementation of high resolution timers ('hrtimer' subsystem) did not handle the situation when a timer was started simultaneously with its restart in another thread. As a result, a BUG_ON() could trigger in __run_hrtimer() leading to kernel crash.
Soft lockup in xfrm_policy_flush().
If an error occurred during execution of xfrm_net_init() when a new network namespace was created, xfrm_policy_lock could remain uninitialized. As a result, soft lockup could happen in xfrm_policy_flush() if it tried to acquire the lock after that.
ploop: kernel crash in dio_open().
It was found that the implementation of ploop did not handle errors reported by kthread_create() properly. This could lead to a kernel crash in dio_open().
Containers with NFS mounts failed to migrate: CRIU complained about nfs/clntX files.
It was discovered that a container with NFS mounts could keep the files /var/lib/nfs/rpc_pipefs/nfs/clntX open, even if no NFS server was running there. As a result, CRIU reported errors when the users tried to migrate the container.
File systems: insufficient error handling in sget() could lead to excessive memory consumption.
sunrpc: potential kernel crash (use after free) in svc_process_common().
Potential out-of-bounds read in fuse_dev_splice_write().
Processes could get stuck in an unkillable state when using large FUSE KIO messages.
It was found that rpc_get_hdr() function from 'fuse_kio_pcs' module did not return valid values in 'msg_size' in some cases. As a result, the processes using large FUSE KIO messages could get stuck in an unkillable state.