2GLM23DYNUIZI7FJEW2SCSNOX42CM5LD6AYLLQXBCVGIQYRSHPHAC from there, we could do `kubectl debug` into the container & lookaround for any huge directory. this app creates temporary files in aflat directory:
from there, we could do `kubectl debug` into the container & look around for any huge directory. this app creates temporary files in a flat directory:
at this point, we made the mistake of rollout-restarting thedeployment. new pods started working fine right away, but the old podswere still around in `Terminating` state. linux was busy removing thefiles in their overlayfs mount, including the above giant directories& their files. that caused ~3hours of high write-iops to commitfilesystem metadata on the nodes, which slowed down other unrelatedworkloads. we had to cordon them during that time, and moved moreimportant pods out of there.
at this point, we made the mistake of rollout-restarting the deployment. new pods started working fine right away, but the old pods were still around in `Terminating` state. linux was busy removing the files in their overlayfs mount, including the above giant directories & their files. that caused ~3hours of high write-iops to commit filesystem metadata on the nodes, which slowed down other unrelated workloads. we had to cordon them during that time, and moved more important pods out of there.
the strangest effect was that pods on other nodes also failedreadiness check randomly. it turned out some of their mysqlrouterswere on the heavy-load nodes. the db clusters were totally fine, theyrun on different HW.
the strangest effect was that pods on other nodes also failed readiness check randomly. it turned out some of their mysqlrouters were on the heavy-load nodes. the db clusters were totally fine, they run on different HW.
i still don't understand how a network-heavy app can be disturbed somuch by disk io. perhaps it checkpoints or logs something in criticalpath? but that's now [water under thebridge](https://youtu.be/4G-YQA_bsOU)
i still don't understand how a network-heavy app can be disturbed so much by disk io. perhaps it checkpoints or logs something in critical path? but that's now [water under the bridge](https://youtu.be/4G-YQA_bsOU)
after 10s, from `ExitedAt` :44:42 to `deadline exceeded` :44:52,containerd gave up on removing the task, and the orphan shim staysaround from then.
after 10s, from `ExitedAt` :44:42 to `deadline exceeded` :44:52, containerd gave up on removing the task, and the orphan shim stays around from then.
again with ebpf in one hand and the flow around this area in another,we think that discarding/unmounting each container's overlayfilesystem are io-intensive as well as io-sensitive.
again with ebpf in one hand and the flow around this area in another, we think that discarding/unmounting each container's overlay filesystem are io-intensive as well as io-sensitive.
cloudfoundry discussed [this generalproblem](https://www.cloudfoundry.org/blog/an-overlayfs-journey-with-the-garden-team/). coccocis following with great interest for newer containerd 1.6 LTS releasesfor fixes around handling overlay deletion. some recent ones improvedshort-term, temporary mounts by marking them readonly. containerdmaintainers also made a [greatreproduction](https://github.com/containerd/containerd/pull/9004/files#diff-1d0d1c3863f35bb86ef37975c4e1a2062e6ca71e6f6a94dc385f8a3556284ddcR117)with `strace` fault-injection feature:
cloudfoundry discussed [this general problem](https://www.cloudfoundry.org/blog/an-overlayfs-journey-with-the-garden-team/). coccoc is following with great interest for newer containerd 1.6 LTS releases with fixes around handling overlay deletion. some recent ones improved short-term, temporary mounts by marking them readonly. containerd maintainers also made a [great reproduction](https://github.com/containerd/containerd/pull/9004/files#diff-1d0d1c3863f35bb86ef37975c4e1a2062e6ca71e6f6a94dc385f8a3556284ddcR117) with `strace` fault-injection feature:
we can't do much else to mitigate this problem. due to the nature ofphp/nodejs/python applications with many loose files for eachcontainer, and the way we [pass php files to nginxcontainers](https://medium.com/coccoc-engineering-blog/our-journey-to-kubernetes-container-design-principles-is-your-back-bag-9166fc4736d2#957e)in a shared `emptyDir` volume.
a more fundamental fix, using overlayfs `volatile` mode to alleviate whole-system load, is [in design phase for now](https://github.com/containerd/containerd/pull/4785).
we can't do much else to mitigate this problem. due to the nature of php/nodejs/python applications with many loose files for each container, and the way we [pass php files to nginx containers](https://medium.com/coccoc-engineering-blog/our-journey-to-kubernetes-container-design-principles-is-your-back-bag-9166fc4736d2#957e) in a shared `emptyDir` volume.
onward to the main title. as part of migrating on-host applications tok8s, we mount some files as `hostPath` volume into containers, and lethost cronjobs write new data into them. for a time, this workedcorrectly:
onward to the main title. as part of migrating on-host applications to k8s, we mount some files as `hostPath` volume into containers, and let host cronjobs write new data into them. for a time, this worked correctly:
but update to the cronjob code introduced a new phenomenon. on host, wecan see the new data in the file, but k8s pods read only old data.
but update to the cronjob code introduced a new phenomenon. on host, we can see the new data in the file, but k8s pods read only old data.
`hostPath` is implemented as a bind-mount, so it's "translated" tospecific inode once at the pod setup phase. after `mv` rewrote thepath to different inode, `68812816` is kept alive only by the mountnamespace. it's similar to a running process holding open a deletedfile, giving `DEL` state in `lsof` listings. but this 0-link file isstill reachable from host, via the container's `root/` under `/proc`:
`hostPath` is implemented as a bind-mount, so it's "translated" to specific inode once at the pod setup phase. after `mv` rewrote the path to different inode, `68812816` is kept alive only by the mount namespace. it's similar to a running process holding open a deleted file, giving `DEL` state in `lsof` listings. but this 0-link file is still reachable from host, via the container's `root/` under `/proc`:
our mitigation for this one was changing `hostPath` up a level, toshare the more-stable directory as volume instead. it would stillbreak the same way if someone rename the directory, but it's lesslikely. and further, we'll work with people to share the data updatesin a more robust way.
our mitigation for this one was changing `hostPath` up a level, to share the more-stable directory as volume instead. it would still break the same way if someone rename the directory, but it's less likely. and further, we'll work with people to share the data updates in a more robust way.