5YVR6PDW34SBYSYB6L6AVZUKMGWEUHGDTLFAIQN6XEKEL65SOYUQC impacted just 1 workload. redhat has a nice articlehttps://access.redhat.com/solutions/29894 explaining the mechanism atwork, and a note explaining why it happened occasionally:
impacted just 1 workload. redhat has a [nicearticle](https://access.redhat.com/solutions/29894) explaining themechanism at work, and a note explaining why it happened occasionally:
`ebpf` toolinghttps://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#1-kprobekretprobe-dynamic-tracing-kernel-level,and the message already has a function-ish hint
[`ebpf`tooling](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#1-kprobekretprobe-dynamic-tracing-kernel-level),and the message contained a function-ish hint
were still around in `Terminating` state. their mount overlayfs werebusy removing the un-shared files, including the above giantdirectories & their files. that caused ~3hours of high write-iops for
were still around in `Terminating` state. linux was busy removing thefiles in their overlayfs mount, including the above giant directories& their files. that caused ~3hours of high write-iops to commit
- all of the extra ones are `containerd-shim` processes, with no related running containers
- all of the extra ones are `containerd-shim` processes, with norelated running containerseach shim is small, but the overall buildup causes system-widedegradation:- `atop`, our favorite recording tool, starts failing with `Mallocfailed for compression buffer`- anything that scales per-process or per-cgroup has more useless work to do
after 10s, from :44:42 to :44:52, containerd gave up on removing the task, and the orphan shim stays around from then. each shim is small, but the overall buildup causes system-wide degradation:- `atop`, our favorite recording tool, starts failing with `Malloc failed for compression buffer`- anything that scales per-process or per-cgroup has more useless work to do
after 10s, from `ExitedAt` :44:42 to `deadline exceeded` :44:52,containerd gave up on removing the task, and the orphan shim staysaround from then.
again with ebpf and investigating the flow around log message, we think that discarding/unmounting each container's overlay filesystem are io-intensive as well as io-sensitive.
again with ebpf in one hand and the flow around this area in another,we think that discarding/unmounting each container's overlayfilesystem are io-intensive as well as io-sensitive.
cloudfoundry discussed this general problem athttps://www.cloudfoundry.org/blog/an-overlayfs-journey-with-the-garden-team/. coccocis watching out with great interest for newer containerd 1.6 releaseswith fix around handling overlay deletion. some recent ones improvedshort-term, temporary mounts by marking them readonly. containerd maintainers also made a great reproduction with `strace` fault-injection feature https://github.com/containerd/containerd/pull/9004/files#diff-1d0d1c3863f35bb86ef37975c4e1a2062e6ca71e6f6a94dc385f8a3556284ddcR117
cloudfoundry discussed [this generalproblem](https://www.cloudfoundry.org/blog/an-overlayfs-journey-with-the-garden-team/). coccocis following with great interest for newer containerd 1.6 LTS releasesfor fixes around handling overlay deletion. some recent ones improvedshort-term, temporary mounts by marking them readonly. containerdmaintainers also made a [greatreproduction](https://github.com/containerd/containerd/pull/9004/files#diff-1d0d1c3863f35bb86ef37975c4e1a2062e6ca71e6f6a94dc385f8a3556284ddcR117)with `strace` fault-injection feature:
we don't have much else to mitigate this problem. due to the nature of php/nodejs/python applications with many loose files for each container, and the way we pass php files to nginx containers in a shared `emptyDir` volume:https://medium.com/coccoc-engineering-blog/our-journey-to-kubernetes-container-design-principles-is-your-back-bag-9166fc4736d2#957e
we can't do much else to mitigate this problem. due to the nature ofphp/nodejs/python applications with many loose files for eachcontainer, and the way we [pass php files to nginxcontainers](https://medium.com/coccoc-engineering-blog/our-journey-to-kubernetes-container-design-principles-is-your-back-bag-9166fc4736d2#957e)in a shared `emptyDir` volume.
k8s, we mounts some files as `hostPath` volume into containers, butfrom time to time host cronjobs write new data into them. for a time,this worked correctly.
k8s, we mount some files as `hostPath` volume into containers, and lethost cronjobs write new data into them. for a time, this workedcorrectly:
but update to the cronjob code introduced new phenomenon. on host, wecan see the new data in the file, but k8s pods see only old data.
but update to the cronjob code introduced a new phenomenon. on host, wecan see the new data in the file, but k8s pods read only old data.
`hostPath` is implemented as a bind-mount from host, so it's"translated" to specific inode once at the pod setup phase. after `mv`rewrote the path to different inode, `68812816` is kept alive only bythe mount namespace. it's similar to a running process holding open adeleted file, giving `DEL` state in `lsof` listings. but this 0-linkfile is still reachable from host namespace, via the container's`root/` access in `/proc`:
`hostPath` is implemented as a bind-mount, so it's "translated" tospecific inode once at the pod setup phase. after `mv` rewrote thepath to different inode, `68812816` is kept alive only by the mountnamespace. it's similar to a running process holding open a deletedfile, giving `DEL` state in `lsof` listings. but this 0-link file isstill reachable from host, via the container's `root/` under `/proc`:
share the more-stable directory as volume instead. and further, we'llwork with people to share the data updates in a more robust way.
share the more-stable directory as volume instead. it would stillbreak the same way if someone rename the directory, but it's lesslikely. and further, we'll work with people to share the data updatesin a more robust way.