• mina86.com

  • Categories
  • Code
  • Contact
  • Will the real ARG_MAX please stand up? Part 2

    Posted by Michał ‘mina86’ Nazarewicz on 18th of April 2021 | (cite)

    In part one we’ve looked at the ARG_MAX parameter on Linux-based systems. We’ve established experimentally how it affects arguments passed programs and what influences the value. This time, we’ll look directly at the source to verify our findings and see how the limit looks from the point of view of system libraries and kernel itself.

    C system library

    Application get value of the ARG_MAX parameter from the sysconf function. It’s what the getconf utility uses to report the limit. But even though the result of the function is closely related to the kernel, looking for its definition in the Linux source code is an exercise in futility. Rather, the function is defined in the C system library which, in GNU/Linux distributions, is commonly providedy by the glibc package.

    glibc is a cross-platform library which supports many kernels and architectures. It often includes multiple definitions of the same function each tailored for particular platform. Such is the case with sysconf. Thankfully, our analysis is limited to Linux and in glibc 2.33, the implementation we’re interested in is located in sysdeps/unix/sysv/linux/sysconf.c file and looks as follows:

    #define legacy_ARG_MAX 131072
    
    /* […] */
    
    long int
    __sysconf (int name)
    {
      const char *procfname = NULL;
    
      switch (name)
        {
          /* […] */
    
        case _SC_ARG_MAX:
          {
            struct rlimit rlimit;
            /* Use getrlimit to get the stack limit.  */
            if (__getrlimit (RLIMIT_STACK, &rlimit) == 0)
    	  return MAX (legacy_ARG_MAX, rlimit.rlim_cur / 4);
    
            return legacy_ARG_MAX;
          }
    
          /* […] */
        }
    
      return posix_sysconf (name);
    }

    This code explains discrepancies we’ve observed when testing large stack size limit. While glibc implements the 128 KiB lower bound it’s unaware of the 6 MiB upper bound. Since getconf utility relies on sysconf library function, having the above implementation means that for large stacks the tool will wrongly report ARG_MAX as quarter of maximum stack size.

    glibc isn’t the only library used on Linux systems. Others have their own sysconf implementations which may return different values. uClibc-ng 1.0.38 behaves the same way glibc does while bionic 10.0, dietlibc 0.34 and musl 1.2 return 128 KiB as ARG_MAX.

    The good news is that situation with glibc is about to improve. My recent commit updating sysconf implementation makes the function aware of the 6 MiB upper bound. Once the change finds its way into a glibc release, the library will report ARG_MAX correctly even for large stacks.

    Linux kernel

    On the kernel side, we want to look at the execve system call. Like all other system calls, it is defined using a SYSCAL_DEFINEn macro. It doesn’t take long to find its implementation in fs/exec.c file. In Linux 5.11.11 it looks as follows:

    SYSCALL_DEFINE3(execve,
    		const char __user *, filename,
    		const char __user *const __user *, argv,
    		const char __user *const __user *, envp)
    {
    	return do_execve(getname(filename), argv, envp);
    }

    Definition of do_execve can be found a few lines earlier in the same file. All it does is call do_execveat_common function so that’s what we’re going to take a closer look at. It is where most of the checks and calculations happen:

    static int do_execveat_common(int fd, struct filename *filename,
    			      struct user_arg_ptr argv,
    			      struct user_arg_ptr envp,
    			      int flags)
    {
    	struct linux_binprm *bprm;
    	int retval;
    	/* […] */
    
    	retval = count(argv, MAX_ARG_STRINGS);
    	if (retval < 0)
    		goto out_free;
    	bprm->argc = retval;
    
    	retval = count(envp, MAX_ARG_STRINGS);
    	if (retval < 0)
    		goto out_free;
    	bprm->envc = retval;
    
    	retval = bprm_stack_limits(bprm);
    	if (retval < 0)
    		goto out_free;
    
    	retval = copy_string_kernel(bprm->filename, bprm);
    	if (retval < 0)
    		goto out_free;
    	bprm->exec = bprm->p;
    
    	retval = copy_strings(bprm->envc, envp, bprm);
    	if (retval < 0)
    		goto out_free;
    
    	retval = copy_strings(bprm->argc, argv, bprm);
    	if (retval < 0)
    		goto out_free;
    
    	retval = bprm_execve(bprm, fd, filename, flags);
    
    	/* […] */
    	return retval;
    }

    The two invocations to count function calculate number of command line arguments and environment variables. Each call may fail if the number exceeds MAX_ARG_STRINGS. Technically speaking this is another limit but in practice the constant is over two billion and, as we’ll see later, there is no way to reach this number without reaching other limits first. The only other situation in which count function may return an error is in case of memory fault, but that’s not interesting for our analysis.

    Limit calculation

    bprm_stack_limits is where the actual calculation happens. The function determines the limit and stores it in the bprm structure. It’s defined as follows:

    static int bprm_stack_limits(struct linux_binprm *bprm)
    {
    	unsigned long limit, ptr_size;
    
    	limit = _STK_LIM / 4 * 3;
    	limit = min(limit, bprm->rlim_stack.rlim_cur / 4);
    	limit = max_t(unsigned long, limit, ARG_MAX);
    
    	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
    	if (limit <= ptr_size)
    		return -E2BIG;
    	limit -= ptr_size;
    
    	bprm->argmin = bprm->p - limit;
    	return 0;
    }

    _STK_LIM is the default stack size limit and equals 8 MiB. The first expression in the function is what introduces the upper bound of 6 MiB for arguments. It’s worth noting that it’s a relatively new restriction introduced in Linux 4.13 (and later back-ported to previous releases). Why it’s there might be a story for another time.

    The second expression in the function is what implements the ‘quarter of the stack size’ rule. This is what could be called a ‘normal’ case and definitely is most typical of common desktop and server configurations. With default maximum stack size limit being 8 MiB the default limit for executable arguments ends up being 2 MiB.

    The third expression sets the limit to be no less than the ARG_MAX. This gets a bit confusing. ARG_MAX is supposed to be a dynamic value and here we see a constant of the same name. As often is the case, the explanation lays in the past. Historically the value was constant and defined as a macro in kernel headers. Eventually, a more dynamic approach was introduced but the definition of the macro stuck. To maintain backwards-compatibility, the dynamic calculation kept the old static value as a lower bound.

    The last adjustment in the function is to reserve space for the argv and envp arrays. If the limit cannot accommodate them the function returns an error; otherwise the limit is reduced by the necessary space. This is where we can see that the limit of two billion arguments and environment variables (imposed by the count function called in do_execveat_common) can never be reached. With a 6 MiB upper bound for the limit, the most one could hope for is 1.25 million arguments and that’s only on a 32-bit system with all strings empty.

    The calculated limit is finally stored in argmin field of the bprm structure. It specifies the lowest address at which arguments can still be stored and the value will be checked later on when program executable path, environment variables and command line arguments are copied. Recall that stack grows downward which is why the field specifies the minimum and why it’s calculated by subtracting the argument size limit from the current top of the stack (specified by bprm->p).

    Copying strings

    Finally, do_execveat_common checks the lengths of the strings while copying them to the new program’s memory. First, the path to program’s executable is transferred with the help of copy_string_kernel function which is defined as follows:

    int copy_string_kernel(const char *arg, struct linux_binprm *bprm)
    {
    	int len = strnlen(arg, MAX_ARG_STRLEN) + 1 /* terminating NUL */;
    	unsigned long pos = bprm->p;
    
    	if (len == 0)
    		return -EFAULT;
    	if (!valid_arg_len(bprm, len))
    		return -E2BIG;
    
    	arg += len;
    	bprm->p -= len;
    	if (IS_ENABLED(CONFIG_MMU) && bprm->p < bprm->argmin)
    		return -E2BIG;
    
    	/* [… copy the string …] */
    	/* [… analogous to memcpy(bprm->p, arg, len); …] */
    
    	return 0;
    }

    Firstly, strnlen paired with call to valid_arg_len checks whether the string exceeds MAX_ARG_STRLEN bytes (or 128 KiB). valid_arg_len is a trivial inline function whose body simply states return len <= MAX_ARG_STRLEN;. If the size of the string exceeds the limit, argument list is deemed too long and the function returns an error.

    Then, the function checks if there’s enough space on stack to fit the string. This is done by moving the stack pointer downwards (i.e. subtracting len from bprm->p field) to reserve memory for the argument and checking whether the new position of the edge of the stack crossed the limit (by checking if bprm->p < bprm->argmin). If so, argument list is to long. Otherwise the argument is copied onto the stack.

    The copy_strings function which do_execveat_common function calls to transfer environment variables and command line arguments is entirely analogous. The two differences are that i) source data lives in user-space and ii) the function operates in a loop copying a sequence of strings.

    static int copy_strings(int argc, struct user_arg_ptr argv,
    			struct linux_binprm *bprm)
    {
    	/* […] */
    	int ret;
    
    	while (argc-- > 0) {
    		const char __user *str;
    		int len;
    		unsigned long pos;
    
    		ret = -EFAULT;
    		str = get_user_arg_ptr(argv, argc);
    		if (IS_ERR(str))
    			goto out;
    
    		len = strnlen_user(str, MAX_ARG_STRLEN);
    		if (!len)
    			goto out;
    
    		ret = -E2BIG;
    		if (!valid_arg_len(bprm, len))
    			goto out;
    
    		pos = bprm->p;
    		str += len;
    		bprm->p -= len;
    		if (bprm->p < bprm->argmin)
    			goto out;
    
    		/* [… copy the string …] */
    		/* [… analogous to memcpy(bprm->p, str, len); …] */
    	}
    	ret = 0;
    out:
    	/* […] */
    	return ret;
    }

    Having to read from user-space complicates the function, though much of that complexity has been hidden from the listing above in the elided code. The visible parts are calls to get_user_arg_ptr and strnlen_user instead of strnlen.

    The parts that interests us remain the same: the valid_arg_len call and the bprm->p < bprm->argmin comparison.

    Conclusion

    This concludes the investigation. In the previous article we’ve seen how the argument length limit affects user-space, here we looked at the source code of the kernel to confirm our previous findings. There are still a few minor mysteries — such as why the 6 MiB exists or what happens if maximum stack size is less that 128 KiB — which I may tackle at another time.

    It remains important to remember that our findings are true for Linux only. Other kernels will set the limit differently and count different things towards it. POSIX leaves the details purposefully vague. As a result a portable application may struggle to interpret the limit; it should not only take value of ARG_MAX with a grain of salt but ideally also recover from E2BIG error by reducing number of arguments.

    Fortunately, UNIX-like systems provide a simple solution in the form of xargs and find … -exec … + commands. Those should be much easier to use and sufficient for most cases. They will typically know how to deal with the command’s argument size limit.

    Whatever the case may be, I hope this article has been informative and provided further understanding of the kernel and it’s interaction with user-space}.

    glibc 2.34 has since been released with this fix included.