Skip to content

spurious 500 in libpod/containers/archive (passing bulk input to subprocess: write |1: broken pipe) #6573

@sqwishy

Description

@sqwishy

Issue Description

PUTting a tar into container (PUT /v6.0.0/libpod/containers/derp/archive?path=/) can fail with {"cause":"broken pipe","message":"passing bulk input to subprocess: write |1: broken pipe","response":500}.

I am using podman, not buildah, but I am reporting this here because I think the bug is in github.com/containers/buildah/copier/copier.go:copierWithSubprocess. Therefore, the output of buildah version and buildah info below are from podman version and podman info instead. (Sorry if this is misplaced.)

tldr

I'm inserting this real quick because I am worried the first 85% of this report is way too much context and not that interesting to most readers.

In copierHandlerPut(), a tar is read from a pipe using tar.NewReader(bulkReader). It is read until the tar.Reader's Next() returns io.EOF. But that is seen before the underlying pipe is actually EOF. The pipe can still be read. But the subprocess will exit, before getting a read() that returns zero. This causes io.Copy(bulkReaderWrite, bulkReader) in the sender to fail a write with EPIPE/broken pipe. This happens because the tar files have many trailing null bytes and tar.Reader will not read them all. This is a spurious failure because the pipe between the subprocess has an internal buffer that is large enough that, much of the time, the writer can write all the trailing null bytes into the pipe before the reader quits. But in my case I have a weird environment where the pipe buffer is one quarter the size of everyone else's, so this failure happens very easily.

Steps to reproduce the issue

It's very hard to reproduce this on purpose on a normal system. (Maybe you can help?)

I have a systemd-nspawn machine, in which the non-root user, when creating a pipe with pipe2, is given a pipe with an internal buffer of 8192 bytes. I don't know why. For root on the nspawn machine, or every user on every other system, it's 65536 or something. But this issue, which I believe is a race condition, happens frequently in circumstances where the pipe is smol. But I don't know why the pipe buffer is like that in that one case.

Below is the steps to reproduce this for that regular user in the systemd-machine with the funny pipe.

podman container create --name derp alpine:latest
systemctl --user start podman.socket
curl -v -X PUT \
   --unix-socket /run/user/1000/podman/podman.sock \
   -H content-type:application/x-tar \
   --data-binary @28238376592976.tar \
   http://p/v6.0.0/libpod/containers/derp/archive?path=/

The image or tar doesn't really matter.

Describe the results you received

If I run those steps, the response I get is

> PUT /v6.0.0/libpod/containers/derp/archive?path=/ HTTP/1.1
> Host: p
> User-Agent: curl/8.15.0
> Accept: */*
> content-type:application/x-tar
> Content-Length: 655360
>
* upload completely sent off: 655360 bytes
< HTTP/1.1 500 Internal Server Error
< Api-Version: 1.41
< Content-Type: application/json
< Libpod-Api-Version: 5.7.0
< Server: Libpod/5.7.0 (linux)
< X-Reference-Id: 0xc0000ba008
< Date: Fri, 05 Dec 2025 19:12:36 GMT
< Content-Length: 107
<
{"cause":"broken pipe","message":"passing bulk input to subprocess: write |1: broken pipe","response":500}

And in my journal, podman writes something like this

podman[546]: time="2025-12-05T11:15:02-08:00" level=error msg="passing bulk input to subprocess: write |1: broken pipe"
podman[546]: time="2025-12-05T11:15:02-08:00" level=info msg="Request Failed(Internal Server Error): passing bulk input to subprocess: write |1: broken pipe"
podman[546]: @ - - [05/Dec/2025:11:15:01 -0800] "PUT /v6.0.0/libpod/containers/derp/archive?path=/ HTTP/1.1" 500 107 "" "curl/8.15.0"

Describe the results you expected

A successful response code.

buildah version output

Client:        Podman Engine
Version:       5.7.0
API Version:   5.7.0
Go Version:    go1.25.4 X:nodwarf5
Git Commit:    0370128fc8dcae93533334324ef838db8f8da8cb
Built:         Mon Nov 10 16:00:00 2025
Build Origin:  Fedora Project
OS/Arch:       linux/amd64

buildah info output

host:
  arch: amd64
  buildahVersion: 1.42.0
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.13-2.fc43.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.13, commit: '
  cpuUtilization:
    idlePercent: 98.85
    systemPercent: 0.39
    userPercent: 0.76
  cpus: 4
  databaseBackend: sqlite
  distribution:
    distribution: fedora
    variant: container
    version: "43"
  eventLogger: journald
  freeLocks: 2048
  hostname: materialist
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.15.6-200.fc42.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 1271652352
  memTotal: 16625729536
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.17.0-1.fc43.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.17.0
    package: netavark-1.17.0-1.fc43.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.17.0
  ociRuntime:
    name: crun
    package: crun-1.25.1-1.fc43.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.25.1
      commit: 156ae065d4a322d149c7307034f98d9637aa92a2
      rundir: /run/user/0/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20250919.g623dbf6-1.fc43.x86_64
    version: |
      pasta 0^20250919.g623dbf6-1.fc43.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 12444422144
  swapTotal: 12884893696
  uptime: 1265h 25m 57.00s (Approximately 52.71 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.additionalImageStores:
    - /usr/lib/containers/storage
    overlay.imagestore: /usr/lib/containers/storage
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 966288998400
  graphRootUsed: 793064759296
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 5.7.0
  BuildOrigin: Fedora Project
  Built: 1762819200
  BuiltTime: Mon Nov 10 16:00:00 2025
  GitCommit: 0370128fc8dcae93533334324ef838db8f8da8cb
  GoVersion: go1.25.4 X:nodwarf5
  Os: linux
  OsArch: linux/amd64
  Version: 5.7.0

Provide your storage.conf

$ cat $XDG_CONFIG_HOME/containers/storage.conf
cat: /containers/storage.conf: No such file or directory

$ cat /etc/containers/storage.conf
cat: /etc/containers/storage.conf: No such file or directory

$ cat /usr/share/containers/storage.conf
# This file is the configuration file for all tools
# that use the containers/storage library. The storage.conf file
# overrides all other storage.conf files. Container engines using the
# container/storage library do not inherit fields from other storage.conf
# files.
#
#  Note: The storage.conf file overrides other storage.conf files based on this precedence:
#      /usr/containers/storage.conf
#      /etc/containers/storage.conf
#      $HOME/.config/containers/storage.conf
#      $XDG_CONFIG_HOME/containers/storage.conf (If XDG_CONFIG_HOME is set)
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]

# Default Storage Driver, Must be set for proper operation.
driver = "overlay"

# Temporary storage location
runroot = "/run/containers/storage"

# Primary Read/Write location of container storage
# When changing the graphroot location on an SELINUX system, you must
# ensure  the labeling matches the default locations labels with the
# following commands:
# semanage fcontext -a -e /var/lib/containers/storage /NEWSTORAGEPATH
# restorecon -R -v /NEWSTORAGEPATH
graphroot = "/var/lib/containers/storage"

# Optional alternate location of image store if a location separate from the
# container store is required. If set, it must be different than graphroot.
# imagestore = ""


# Storage path for rootless users
#
# rootless_storage_path = "$HOME/.local/share/containers/storage"

# Transient store mode makes all container metadata be saved in temporary storage
# (i.e. runroot above). This is faster, but doesn't persist across reboots.
# Additional garbage collection must also be performed at boot-time, so this
# option should remain disabled in most configurations.
# transient_store = true

[storage.options]
# Storage options to be passed to underlying storage drivers

# AdditionalImageStores is used to pass paths to additional Read/Only image stores
# Must be comma separated list.
additionalimagestores = [
"/usr/lib/containers/storage",
]

# Allows specification of how storage is populated when pulling images. This
# option can speed the pulling process of images compressed with format
# zstd:chunked. Containers/storage looks for files within images that are being
# pulled from a container registry that were previously pulled to the host.  It
# can copy or create a hard link to the existing file when it finds them,
# eliminating the need to pull them from the container registry. These options
# can deduplicate pulling of content, disk storage of content and can allow the
# kernel to use less memory when running containers.

# containers/storage supports four keys
#   * enable_partial_images="true" | "false"
#     Tells containers/storage to look for files previously pulled in storage
#     rather then always pulling them from the container registry.
#   * use_hard_links = "false" | "true"
#     Tells containers/storage to use hard links rather then create new files in
#     the image, if an identical file already existed in storage.
#   * ostree_repos = ""
#     Tells containers/storage where an ostree repository exists that might have
#     previously pulled content which can be used when attempting to avoid
#     pulling content from the container registry
#   * convert_images = "false" | "true"
#     If set to true, containers/storage will convert images to a
#     format compatible with partial pulls in order to take advantage
#     of local deduplication and hard linking.  It is an expensive
#     operation so it is not enabled by default.
pull_options = {enable_partial_images = "true", use_hard_links = "false", ostree_repos=""}

# Remap-UIDs/GIDs is the mapping from UIDs/GIDs as they should appear inside of
# a container, to the UIDs/GIDs as they should appear outside of the container,
# and the length of the range of UIDs/GIDs.  Additional mapped sets can be
# listed and will be heeded by libraries, but there are limits to the number of
# mappings which the kernel will allow when you later attempt to run a
# container.
#
# remap-uids = "0:1668442479:65536"
# remap-gids = "0:1668442479:65536"

# Remap-User/Group is a user name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid or /etc/subgid file.  Mappings are set up starting
# with an in-container ID of 0 and then a host-level ID taken from the lowest
# range that matches the specified name, and using the length of that range.
# Additional ranges are then assigned, using the ranges which specify the
# lowest host-level IDs first, to the lowest not-yet-mapped in-container ID,
# until all of the entries have been used for maps. This setting overrides the
# Remap-UIDs/GIDs setting.
#
# remap-user = "containers"
# remap-group = "containers"

# Root-auto-userns-user is a user name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid and /etc/subgid file.  These ranges will be partitioned
# to containers configured to create automatically a user namespace.  Containers
# configured to automatically create a user namespace can still overlap with containers
# having an explicit mapping set.
# This setting is ignored when running as rootless.
# root-auto-userns-user = "storage"
#
# Auto-userns-min-size is the minimum size for a user namespace created automatically.
# auto-userns-min-size=1024
#
# Auto-userns-max-size is the maximum size for a user namespace created automatically.
# auto-userns-max-size=65536

[storage.options.overlay]
# ignore_chown_errors can be set to allow a non privileged user running with
# a single UID within a user namespace to run containers. The user can pull
# and use any image even those with multiple uids.  Note multiple UIDs will be
# squashed down to the default uid in the container.  These images will have no
# separation between the users in the container. Only supported for the overlay
# and vfs drivers.
#ignore_chown_errors = "false"

# Inodes is used to set a maximum inodes of the container image.
# inodes = ""

# Path to an helper program to use for mounting the file system instead of mounting it
# directly.
#mount_program = "/usr/bin/fuse-overlayfs"

# mountopt specifies comma separated list of extra mount options
mountopt = "nodev,metacopy=on"

# Set to skip a PRIVATE bind mount on the storage home directory.
# skip_mount_home = "false"

# Set to use composefs to mount data layers with overlay.
# use_composefs = "false"

# Size is used to set a maximum size of the container image.
# size = ""

# ForceMask specifies the permissions mask that is used for new files and
# directories.
#
# The values "shared" and "private" are accepted.
# Octal permission masks are also accepted.
#
#  "": No value specified.
#     All files/directories, get set with the permissions identified within the
#     image.
#  "private": it is equivalent to 0700.
#     All files/directories get set with 0700 permissions.  The owner has rwx
#     access to the files. No other users on the system can access the files.
#     This setting could be used with networked based homedirs.
#  "shared": it is equivalent to 0755.
#     The owner has rwx access to the files and everyone else can read, access
#     and execute them. This setting is useful for sharing containers storage
#     with other users.  For instance have a storage owned by root but shared
#     to rootless users as an additional store.
#     NOTE:  All files within the image are made readable and executable by any
#     user on the system. Even /etc/shadow within your image is now readable by
#     any user.
#
#   OCTAL: Users can experiment with other OCTAL Permissions.
#
#  Note: The force_mask Flag is an experimental feature, it could change in the
#  future.  When "force_mask" is set the original permission mask is stored in
#  the "user.containers.override_stat" xattr and the "mount_program" option must
#  be specified. Mount programs like "/usr/bin/fuse-overlayfs" present the
#  extended attribute permissions to processes within containers rather than the
#  "force_mask"  permissions.
#
# force_mask = ""

Upstream Latest Release

Yes

Additional environment details

my .nspawn file

[Exec]
PrivateUsers=990969856:131072
LinkJournal=try-host

and systemctl cat systemd-nspawn@materialist.service

# /usr/lib/systemd/system/systemd-nspawn@.service
#  SPDX-License-Identifier: LGPL-2.1-or-later
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Container %i
Documentation=man:systemd-nspawn(1)
Wants=modprobe@tun.service modprobe@loop.service modprobe@dm_mod.service
PartOf=machines.target
Before=machines.target
After=network.target modprobe@tun.service modprobe@loop.service modprobe@dm_mod.service
RequiresMountsFor=/var/lib/machines/%i

[Service]
# Make sure the DeviceAllow= lines below can properly resolve the 'block-loop' expression (and others)
ExecStart=systemd-nspawn --quiet --keep-unit --boot --link-journal=try-guest --network-veth -U --settings=override --machine=%i
KillMode=mixed
Type=notify
RestartForceExitStatus=133
SuccessExitStatus=133
Slice=machine.slice
Delegate=yes
DelegateSubgroup=supervisor
CoredumpReceive=yes
TasksMax=16384


DevicePolicy=closed
DeviceAllow=/dev/net/tun rwm
DeviceAllow=char-pts rw
DeviceAllow=/dev/fuse rwm

# nspawn itself needs access to /dev/loop-control and /dev/loop, to implement
# the --image= option. Add these here, too.
DeviceAllow=/dev/loop-control rw
DeviceAllow=block-loop rw
DeviceAllow=block-blkext rw

# nspawn can set up LUKS encrypted loopback files, in which case it needs
# access to /dev/mapper/control and the block devices /dev/mapper/*.
DeviceAllow=/dev/mapper/control rw
DeviceAllow=block-device-mapper rw

[Install]
WantedBy=machines.target

# /usr/lib/systemd/system/service.d/10-timeout-abort.conf
# This file is part of the systemd package.
# See https://fedoraproject.org/wiki/Changes/Shorter_Shutdown_Timer.
#
# To facilitate debugging when a service fails to stop cleanly,
# TimeoutStopFailureMode=abort is set to "crash" services that fail to stop in
# the time allotted. This will cause the service to be terminated with SIGABRT
# and a coredump to be generated.
#
# To undo this configuration change, create a mask file:
#   sudo mkdir -p /etc/systemd/system/service.d
#   sudo ln -sv /dev/null /etc/systemd/system/service.d/10-timeout-abort.conf

[Service]
TimeoutStopFailureMode=abort

# /usr/lib/systemd/system/service.d/50-keep-warm.conf
# Disable freezing of user sessions to work around kernel bugs.
# See https://bugzilla.redhat.com/show_bug.cgi?id=2321268
[Service]
Environment=SYSTEMD_SLEEP_FREEZE_USER_SESSIONS=0

# /etc/systemd/system/systemd-nspawn@materialist.service.d/override.conf
[Service]
Environment=SYSTEMD_SECCOMP=0
Slice=materialist.slice

MemoryAccounting=yes
MemoryHigh=2G
MemoryMax=4G
# MemorySwapMax=500M

CPUAccounting=true
# 2 cores
CPUQuota=200%

Additional information

The tar being sent in the request, as far as I can tell, is a totally valid tar created from the tar program that I installed with my package manager. It has like six thousand trailing null bytes. Other tar files I have that fail randomly have anywhere from about one to nine thousand trailing null bytes. That seems like a normal thing to me I guess.

From what I can tell from the buildah copier code, a subprocess is created and the tar is sent from the http request to the subprocess through a pipe. The tar is read in copierHandlerPut() from the pipe using tar.NewReader(bulkReader) until getting io.EOF, from the tar reader. But it seems we see io.EOF from the tar.Reader before the pipe is EOF.

Here is an strace of the process as it reads and writes the last file of the tar. Some strings were elided and I omitted calls to fchmodat/fchownat for brevity.

981998 openat(AT_FDCWD, "/baro/mod/28238376592976/filelist.xml", ... ) = 5
...
981998 read(3, "\357\273\277<?xml ... >", 384) = 384
981998 write(5, "\357\273\277<?xml ... >", 384) = 384
981998 close(5)                         = 0
...
981998 read(3, "\0\0\0\0 ... \0\0\0\0", 128) = 128
981998 read(3, "\0\0\0\0 ... \0\0\0\0", 512) = 512
981998 read(3, "\0\0\0\0 ... \0\0\0\0", 512) = 512
...
981998 read(0, "{\"Request\":\"QUIT\", ... }\n", 1535) = 1136
981998 exit_group(0 <unfinished ...>
982002 <... futex resumed>)             = ?
982001 <... futex resumed>)             = ?
982000 <... futex resumed>)             = ?
981999 <... futex resumed>)             = ?
981998 <... exit_group resumed>)        = ?
981997 <... epoll_pwait resumed> <unfinished ...>) = ?
981996 <... nanosleep resumed> <unfinished ...>) = ?
981995 <... futex resumed>)             = ?
982002 +++ exited with 0 +++
982001 +++ exited with 0 +++
982000 +++ exited with 0 +++
981999 +++ exited with 0 +++
981998 +++ exited with 0 +++

The tar is read on fd 3, two 512 blocks of null bytes are read from that pipe before quitting. There is no read() that returns zero before the process quits. But that tar has like 6272 trailing null bytes in it.

The _, readError = io.Copy(bulkReaderWrite, bulkReader) in the sender has not necessarily returned. Most of the time, if the pipe's internal buffer is 65536 bytes instead of 8196, then there is enough room in the pipe to write those null bytes that the subprocess will never read. In my case, where the pipe is only 8192 big, it's very easy that -- in the course of the writer waiting/polling on the pipe becoming writable -- the reader reads those two 512 null byte chunks, quits, and closes its end of the pipe, before the writer can put the rest of the null bytes into the buffer. So it writes to the broken pipe and fails.

These traces, from the same trace above, show 8192 length reads and writes during this time.

980452 write(27, ""..., 32768 <unfinished ...>
981998 <... read resumed>"", 32768) = 8192
980452 <... write resumed>)             = 8192

Whereas traces every other system I've checked, is typically reading/writing 32768 at a time .

In my nspawn machine, if I make a pipe in Python, I can clearly see the buffer size is limited for the regular user only.

[root@materialist ~]# python3 -c 'import os; r, w = os.pipe2(os.O_NONBLOCK); print(os.write(w, b"\x00" * 131072))'
65536
[user@materialist ~]$ python3 -c 'import os; r, w = os.pipe2(os.O_NONBLOCK); print(os.write(w, b"\x00" * 131072))'
8192

Both root and regular users on the host running nspawn and on my other computer all show a buffer size of 65536, so I don't know why this is happening in that one case, but this seems to be the cause.

I do believe this is a race condition though. The writer is often faster than the reader and the buffer makes it rare, but I have noticed spurious failures on my normal machine and, looking back at my logs now, I see the same message.

podman[3480460]: time="2025-11-23T18:17:06-08:00" level=error msg="passing bulk input to subprocess: write |1: broken pipe"
podman[3480460]: time="2025-11-23T18:17:06-08:00" level=info msg="Request Failed(Internal Server Error): passing bulk input to subprocess: write |1: broken pipe"
podman[3480460]: @ - - [23/Nov/2025:18:17:06 -0800] "PUT /v6.0.0/libpod/containers/8c2f2cd6b65199f34b62ab16e5dfccd00d3c33ccabadfdf80d0762d0ffcafbb0/archive?path=/baro/mod/28222139786160 HTTP/1.1" 500 107 "" "..."

Unfortunately, I can't reproduce it on purpose except on that one nspawn machine with the itty bitty pipe.

And I'm not sure why the pipe comes out that way. There are kernel settings like fs.pipe-user-pages-hard fs.pipe-user-pages-soft that are the same on both systems. And modifying/dropping capabilities like sys_admin and sys_resource didn't seem to do anything. So I don't know how you can modify your system so that it is weird like mine and makes this easy to reproduce.

I also noticed in the man page for tar the following option:

 -i, --ignore-zeros
        Ignore  zeroed  blocks  in archive.  Normally two consecutive 512-blocks filled with zeroes mean
        EOF and tar stops reading after encountering them.  This option instructs it to read further and
        is useful when reading archives created with the -A option.

Which looked like those two read that the tar.Reader in Go is doing before returning EOF. So it seems like probably that is by design, and the tar files I'm sending in the request are normal and not weird, and that maybe the copier should read the rest of the pipe until EOF or something before quitting?

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions