Alpaquita Linux: Selecting a malloc implementation

1. Default malloc implementation in libc

glibc uses ptmalloc2 (pthreads malloc) general-purpose memory allocator. This is one of the most proven and fastest malloc implementations for multiple threads without lock contention. It has a good speed/memory balance and tunable parameters via GLIBC_TUNABLES environment variable.

musl utilizes the so called mallocng allocator introduced and used by default since version 1.2.1 in 2020. It is similar to OpenBSD’s omalloc. It is known for its strengthened protection against heap-based overflows, use-after-free and double-free errors, and better fragmentation avoidance.

For more information, check the following links:

2. External general-purpose memory allocators

In addition to the built-in allocators, there are a number of other external allocators that are available for both glibc and musl:

mimalloc: outperforms other leading allocators (jemalloc, tcmalloc, etc.), and often uses less memory.
mimalloc-secure: mimalloc built in secure mode that adds guard pages, randomized allocation, and encrypted free lists to protect against various heap vulnerabilities. It has an around 10% performance penalty in various benchmarks compared to mimalloc.
jemalloc: implementation that emphasizes fragmentation avoidance and scalable concurrency support.
rpmalloc: a general purpose allocator with lock-free thread caching.

3. Enable external allocators globally

To switch to one of these allocators, such as to mimalloc, run the following command in Alpaquita Linux:

apk add mimalloc-global

-global packages work with LD_PRELOAD environment variable that is exported via /etc/profile.d/ configuration scripts. Only one -global package can be installed at a time. If the package is installed, the new one replaces the installed package during the installation.

apk add jemalloc-global
...
(1/3) Installing jemalloc (5.3.0-r1)
(2/3) Installing jemalloc-global (5.3.0-r1)
Executing jemalloc-global-5.3.0-r1.post-install
*
* Please logout and login to apply the changes to your environment
* or exec 'source /etc/profile.d/jemalloc.sh'
*
(3/3) Purging mimalloc-global (1.7.7-r1)

Make sure it is used globally as follows:

ldd /bin/busybox
  ldd /bin/busybox
  /lib/ld-musl-x86_64.so.1 (0x7ff28dffc000)
  /lib/libjemalloc.so.2 => /lib/libjemalloc.so.2 (0x7ff28dc00000)
  ...

4. Enable external allocators individually

If you want to use external allocators only for specific packages, and even more, if you want to use different external allocators at the same time, you should not use global packages. Instead, manually provide the LD_PRELOAD environment variable only for specific applications.

Installing several allocators:

apk add jemalloc mimalloc
(1/2) Installing jemalloc (5.3.0-r1)
(2/2) Installing mimalloc (1.8.1-r0)
Executing busybox-1.36.0-r7.trigger
OK: 866 MiB in 224 packages

Using jemalloc only for java and mimalloc for python app:
```
jemalloc.sh java app
mimalloc.sh python -m app
```

If you do not want to use LD_PRELOAD and know exactly that the app should use certain external allocators right out of the box, build your app statically with one of these implementations. Alpaquita provides -dev and -static packages for this case.

5. Debugging glibc malloc

All the debugging features in glibc malloc are moved into a separate library named libc_malloc_debug.so.0 for better security and performance. Perform the following to enable it:

export GLIBC_TUNABLES=glibc.malloc.check=3
LD_PRELOAD=/usr/lib/libc_malloc_debug.so.0 ./app

For more information, see: Securing malloc in glibc.

6. Selecting memory allocators

Memory allocations in Java

The JVM has its own heap that is maintained by the garbage collector. However, it also uses calls to malloc() to implement several Java objects, such as:

Buffers for data compression routines, which are the memory space that the Java Class Libraries require to read or write compressed data like .zip or .jar files.
Malloc allocations by application JNI code.
Compiled code generated by the Just In Time (JIT) Compiler.
Threads to map to Java threads.

When your app is extensively using such objects it is worth to look and check the performance with alternative allocators.

Memory allocation in Python

Default memory allocator in Python is pymalloc allocator that is optimized for small objects (smaller or equal to 512 bytes) with a short lifetime. For larger allocations, it falls back to the system allocator. Under the hood it can use mmap()/munmap() and malloc()/free(). You can disable the pymalloc with PYTHONMALLOC=malloc environment variable. It might be handy to see the various statistics of the pymalloc using the PYTHONMALLOCSTATS=1 variable.

Even if pymalloc is still used by the Python interpreter by default for small allocations, you can use the techniques described above to specify a different malloc implementation for larger allocations or for allocations in parts of your Python application implemented in C.

Choosing better performance

Let’s create a sample Python application. We will not disable pymalloc and only examine the libc malloc allocation there, the number of malloc calls, and the allocated lengths. This practice can show what the application does at runtime, whether it makes sense to change or experiment with another allocator.

First, install the perf tool and python3:
```
apk add perf python3
```

Let’s create an artificial mini-benchmark as shown below that can read the same file line by line (for simplicity, /proc/kallsyms file) in parallel using 10 reader-threads.

Essentially, we are comparing the time it takes to create the Python threads and the time it takes to read an entire file for all reader threads.

import threading
import datetime

OUTER_THREADS_N = 2
INNER_THREADS_N = 5

def read_file():
    with open('/proc/kallsyms') as file:
        for line in file:
            pass

def spawn_threads(n, fn_name, args):
    threads = []
    for i in range(n):
        t = threading.Thread(target=fn_name, args=args)
        t.start()
        threads.append(t)
    for t in threads:
        t.join()

dt0 = datetime.datetime.now()
spawn_threads(OUTER_THREADS_N, spawn_threads, (INNER_THREADS_N, read_file, ()))

diff = datetime.datetime.now() - dt0
print(f"{diff.total_seconds()}")

Add a dynamic tracepoint for our system allocations in musl:

perf probe -x /usr/lib/ld-musl-x86_64.so.1 'default_malloc n:u32'
perf probe -x /usr/lib/ld-musl-x86_64.so.1 'realloc n:u32'

Run memory.py:

perf record -z -e probe_ld:default_malloc -e probe_ld:realloc python memory.py

Once the application finished running, you can analyze its output.

Created tasks overview:

perf report --tasks
#      pid      tid     ppid  comm
         0        0       -1 |swapper
      3565     3565       -1 |python
      3565     3567     3565 | python
      3565     3568     3567 |  python
      3565     3569     3567 |  python
      3565     3572     3567 |  python
      3565     3573     3567 |  python
      3565     3574     3567 |  python
      3565     3570     3565 | python
      3565     3571     3570 |  python
      3565     3575     3570 |  python
      3565     3576     3570 |  python
      3565     3577     3570 |  python
      3565     3578     3570 |  python

Various statisitcs:

perf report --stats
Aggregated stats:
           TOTAL events:      39845
...
probe_ld:default_malloc stats:
          SAMPLE events:      26821
probe_ld:realloc stats:
          SAMPLE events:      12832

You can see the overhead and the hot malloc lengths:

perf report --stdio -n
# Samples: 26K of event 'probe_ld:default_malloc'
# Event count (approx.): 26929
#
# Overhead       Samples  Trace output
# ........  ............  ...........................
#
    47.05%         12670  (7fddb81be420) n_u32=8225
     1.67%           450  (7fddb81be420) n_u32=4140
     1.41%           380  (7fddb81be420) n_u32=4127
     1.37%           370  (7fddb81be420) n_u32=4116
     ...

In our example, the length is 8225 bytes. The allocation comes from reading the file. Have a look at the allocations for one of our reader threads (remember, we had 10 such threads):

perf report -F pid,overhead,sample,trace --stdio -n
     ...
     3578:python      4.72%          1266  (7fb52688a420) n_u32=8225
     3578:python      0.15%            40  (7fb52688a420) n_u32=4140
     3578:python      0.14%            37  (7fb52688a420) n_u32=4116
     ...

From the above example we can see that our simple application calls malloc() approximately 26 thousand times and realloc() 13 thousand times during runtime, and not just at startup with pymalloc enabled. Experimenting with different allocators is a good idea here. Continue trying all of them as follows:

for i in dummy rpmallocwrap.so jemalloc.so.2 mimalloc.so.1 mimalloc-secure.so.1; do
    echo "$i:"
    LD_PRELOAD=/lib/lib$i python memory.py
done

The table below shows that replacing musl’s malloc with even mimalloc-secure improves performance for this workload by ~30%.

	`musl`	`glibc`	`mimalloc`	`mi-secure`	`jemalloc`	`rpmalloc`
musl	0.685	-	0.451	0.472	0.438	0.44
glibc	-	0.439	0.473	0.485	0.447	0.449

Choosing better security

If security is more important than performance, it is better to stick with the default malloc implementation, or use external mimalloc-secure. musl already has some security mitigations that prevent exploiting bugs in the calling application.

musl security features:

Heap metadata protected by the guard pages.
Detect and trap any attempt to free a slot that is already free, or an address that is not part of an allocation obtained by malloc.
Write-after-free detection in the next call in case the metadata becomes inconsistent.
Detection and trapping of single-byte overflows with arbitrary non-zero values on realloc/free time.

mimalloc-secure features:

All internal mimalloc pages are surrounded by guard pages, and the heap metadata is also behind a guard page, therefore, a buffer overflow exploit cannot reach the metadata.
All free list pointers are encoded with per-page keys which are used both to prevent overwriting with a known pointer, and to detect heap corruption.
Double free’s are detected (and ignored).
The free lists are initialized in a random order, and allocation randomly chooses between an extension and reuse within a page to mitigate against attacks that rely on a predicable allocation order. Similarly, the larger heap blocks allocated by mimalloc from the OS have randomized addresses.