hdhoang/homes - Change 5YVR6PDW34SBYSYB6L6AVZUKMGWEUHGDTLFAIQN6XEKEL65SOYUQC

edit

Created by hdhoang on September 7, 2023

5YVR6PDW34SBYSYB6L6AVZUKMGWEUHGDTLFAIQN6XEKEL65SOYUQC

Dependencies

[2] WF6HOUZUSQPAW6ZTYXW5ZLYUJVYEK7LKVDWQBS7ADZGZDCTVSQ4AC

In channels

main

Change contents

Replacement in container-in-linux.md at line 24 [2.285]

B:BD[2.1550] → [2.1550:1730]

impacted just 1 workload. redhat has a nice article
https://access.redhat.com/solutions/29894 explaining the mechanism at
work, and a note explaining why it happened occasionally:

[2.1550]

[2.1730]

impacted just 1 workload. redhat has a [nice
article](https://access.redhat.com/solutions/29894) explaining the
mechanism at work, and a note explaining why it happened occasionally:

Replacement in container-in-linux.md at line 32 [2.285]

B:BD[2.1991] → [2.1991:2174]

`ebpf` tooling
https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#1-kprobekretprobe-dynamic-tracing-kernel-level,
and the message already has a function-ish hint

[2.1991]

[2.2174]

[`ebpf`
tooling](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#1-kprobekretprobe-dynamic-tracing-kernel-level),
and the message contained a function-ish hint

Replacement in container-in-linux.md at line 67 [2.285]

B:BD[2.3084] → [2.3084:3174]

around for any huge directory. here, the app creates temporary files
in a flat directory:

[2.3084]

[2.3174]

around for any huge directory. this app creates temporary files in a
flat directory:

Replacement in container-in-linux.md at line 77 [2.285]

B:BD[2.3336] → [2.3336:3383]

## that's our reported inode! what's in there?

[2.3336]

[2.3383]

## that's our reported inode! what's in there? links=2 shows that there's no subdirectory

Replacement in container-in-linux.md at line 86 [2.285]

B:BD[2.3594] → [2.3594:3794]

were still around in `Terminating` state. their mount overlayfs were
busy removing the un-shared files, including the above giant
directories & their files. that caused ~3hours of high write-iops for

[2.3594]

[2.3794]

were still around in `Terminating` state. linux was busy removing the
files in their overlayfs mount, including the above giant directories
& their files. that caused ~3hours of high write-iops to commit

Replacement in container-in-linux.md at line 90 [2.285]

B:BD[2.3862] → [2.3862:3962]

workload. we had to cordon them during that time, and moved more
important pods out to other nodes.

[2.3862]

[2.3962]

workloads. we had to cordon them during that time, and moved more
important pods out of there.

Replacement in container-in-linux.md at line 113 [2.285]
B:BD[2.5181] → [2.5181:5187]
```
path?
```
[2.5181]
[2.5187]
```
path? but that's now [water under the
bridge](https://youtu.be/4G-YQA_bsOU)
```

Replacement in container-in-linux.md at line 119 [2.285]

B:BD[2.5227] → [2.5227:5310]

segue from the previous issue, for a long time we have been facing this situation:

[2.5227]

[2.5310]

segue from the previous issue, for a long time we have been facing
this situation:

Replacement in container-in-linux.md at line 123 [2.285]

B:BD[2.5378] → [2.5378:5470]

- all of the extra ones are `containerd-shim` processes, with no related running containers

[2.5378]

[2.5470]

- all of the extra ones are `containerd-shim` processes, with no
  related running containers
each shim is small, but the overall buildup causes system-wide
degradation:
- `atop`, our favorite recording tool, starts failing with `Malloc
  failed for compression buffer`
- anything that scales per-process or per-cgroup has more useless work to do

Replacement in container-in-linux.md at line 133 [2.285]

B:BD[2.5471] → [2.5471:5570]

here is an example which started running & finished its work while the above deletion was running:

[2.5471]

[2.5570]

here is an example which started running & finished its work during
the above deletion:

Replacement in container-in-linux.md at line 155 [2.285]

B:BD[2.8327] → [2.8327:8698]

after 10s, from :44:42 to :44:52, containerd gave up on removing the task, and the orphan shim stays around from then. each shim is small, but the overall buildup causes system-wide degradation:
- `atop`, our favorite recording tool, starts failing with `Malloc failed for compression buffer`
- anything that scales per-process or per-cgroup has more useless work to do

[2.8327]

[2.8698]

after 10s, from `ExitedAt` :44:42 to `deadline exceeded` :44:52,
containerd gave up on removing the task, and the orphan shim stays
around from then.

Replacement in container-in-linux.md at line 159 [2.285]

B:BD[2.8699] → [2.8699:8876]

again with ebpf and investigating the flow around log message, we think that discarding/unmounting each container's overlay filesystem are io-intensive as well as io-sensitive.

[2.8699]

[2.8876]

again with ebpf in one hand and the flow around this area in another,
we think that discarding/unmounting each container's overlay
filesystem are io-intensive as well as io-sensitive.

Replacement in container-in-linux.md at line 210 [2.285]

B:BD[2.12574] → [2.12574:13123]

cloudfoundry discussed this general problem at
https://www.cloudfoundry.org/blog/an-overlayfs-journey-with-the-garden-team/. coccoc
is watching out with great interest for newer containerd 1.6 releases
with fix around handling overlay deletion. some recent ones improved
short-term, temporary mounts by marking them readonly. containerd maintainers also made a great reproduction with `strace` fault-injection feature https://github.com/containerd/containerd/pull/9004/files#diff-1d0d1c3863f35bb86ef37975c4e1a2062e6ca71e6f6a94dc385f8a3556284ddcR117

[2.12574]

[2.13123]

cloudfoundry discussed [this general
problem](https://www.cloudfoundry.org/blog/an-overlayfs-journey-with-the-garden-team/). coccoc
is following with great interest for newer containerd 1.6 LTS releases
for fixes around handling overlay deletion. some recent ones improved
short-term, temporary mounts by marking them readonly. containerd
maintainers also made a [great
reproduction](https://github.com/containerd/containerd/pull/9004/files#diff-1d0d1c3863f35bb86ef37975c4e1a2062e6ca71e6f6a94dc385f8a3556284ddcR117)
with `strace` fault-injection feature:

Replacement in container-in-linux.md at line 225 [2.285]

B:BD[2.13370] → [2.13370:13467]

whole-system load, is in design phase for now
https://github.com/containerd/containerd/pull/4785

[2.13370]

[2.13467]

whole-system load, is [in design phase for
now](https://github.com/containerd/containerd/pull/4785).

Replacement in container-in-linux.md at line 228 [2.285]

B:BD[2.13468] → [2.13468:13825]

we don't have much else to mitigate this problem. due to the nature of php/nodejs/python applications with many loose files for each container, and the way we pass php files to nginx containers in a shared `emptyDir` volume:
https://medium.com/coccoc-engineering-blog/our-journey-to-kubernetes-container-design-principles-is-your-back-bag-9166fc4736d2#957e

[2.13468]

[2.13825]

we can't do much else to mitigate this problem. due to the nature of
php/nodejs/python applications with many loose files for each
container, and the way we [pass php files to nginx
containers](https://medium.com/coccoc-engineering-blog/our-journey-to-kubernetes-container-design-principles-is-your-back-bag-9166fc4736d2#957e)
in a shared `emptyDir` volume.

Replacement in container-in-linux.md at line 235 [2.285]
B:BD[2.13826] → [2.13826:13847]
```
hostPath ghost files
```
[2.13826]
[2.13847]
```
ghost hostPath files
```

Replacement in container-in-linux.md at line 239 [2.285]

B:BD[2.13923] → [2.13923:14084]

k8s, we mounts some files as `hostPath` volume into containers, but
from time to time host cronjobs write new data into them. for a time,
this worked correctly.

[2.13923]

[2.14084]

k8s, we mount some files as `hostPath` volume into containers, and let
host cronjobs write new data into them. for a time, this worked
correctly:

Replacement in container-in-linux.md at line 270 [2.285]

B:BD[2.14494] → [2.14494:14630]

but update to the cronjob code introduced new phenomenon. on host, we
can see the new data in the file, but k8s pods see only old data.

[2.14494]

[2.14630]

but update to the cronjob code introduced a new phenomenon. on host, we
can see the new data in the file, but k8s pods read only old data.

Insertion in container-in-linux.md at line 276 [2.285]
[2.14713]
[2.14713]
```
host$ cat /tmp/data.txt
version3
```

Replacement in container-in-linux.md at line 299 [2.285]

B:BD[2.15211] → [2.15211:15644]

`hostPath` is implemented as a bind-mount from host, so it's
"translated" to specific inode once at the pod setup phase. after `mv`
rewrote the path to different inode, `68812816` is kept alive only by
the mount namespace. it's similar to a running process holding open a
deleted file, giving `DEL` state in `lsof` listings. but this 0-link
file is still reachable from host namespace, via the container's
`root/` access in `/proc`:

[2.15211]

[2.15644]

`hostPath` is implemented as a bind-mount, so it's "translated" to
specific inode once at the pod setup phase. after `mv` rewrote the
path to different inode, `68812816` is kept alive only by the mount
namespace. it's similar to a running process holding open a deleted
file, giving `DEL` state in `lsof` listings. but this 0-link file is
still reachable from host, via the container's `root/` under `/proc`:

Replacement in container-in-linux.md at line 312 [2.285]

B:BD[2.15777] → [2.15777:15912]

share the more-stable directory as volume instead. and further, we'll
work with people to share the data updates in a more robust way.

[2.15777]

[2.15912]

share the more-stable directory as volume instead. it would still
break the same way if someone rename the directory, but it's less
likely. and further, we'll work with people to share the data updates
in a more robust way.