Controlling memory consumption with Intel® Threading Building Blocks (Intel® TBB) scalable allocator

The Intel® Threading Building Blocks (Intel® TBB) library provides an alternative way to allocate dynamic memory - Intel TBB scalable allocator (tbbmalloc). Its purpose is to provide better performance and scalability for memory allocation/deallocation operations in multithreaded applications, compared to the default allocator. But performance improvement comes at a price: increased memory consumption.

If memory footprint becomes a problem, the developers have now the ability to reduce it. This is possible with the new “soft limit” feature that became available in Intel TBB 4.2 update 1. The idea is that the developer sets a limit in bytes by using the scalable_allocation_mode (TBBMALLOC_SET_SOFT_HEAP_LIMIT, <size>) method. When the limit is reached, memory is released from internal buffers in tbbmalloc, decreasing memory consumption. This doesn’t prevent further allocations, just clears the buffers.

NOTE! This is a soft limit, which means there is no guarantee that memory consumption will not exceed it. Setting the limit informs tbbmalloc to change the strategy – it buffers less memory thereby becoming less “greedy,” but it might be a bit slower.

Test example

The following example demonstrates the effect of using the soft limit. You can build the sample by simply putting the code pieces together. Also, the complete source is attached for your convenience. This example implements three tests with different memory allocation patterns – random-size objects, constant-size objects and growing-size objects. Each of the three tests has two modes: using tbbmalloc or using standard malloc.

The first test allocates 1000 objects of random size, and then frees them. This action is repeated 100 times:

#include <iostream>
#include <cstdlib>
#include <stdio.h>
#include <tbb/scalable_allocator.h>
#include <tbb/tick_count.h>

const size_t KB = 1024;
const size_t MB = 1024*KB;

void test1( bool use_default_malloc ) {
    const int REPEATS = 100;
    const int NUM_ALLOCS = 1000;

    const size_t MIN_SIZE = 8;
    const size_t MAX_SIZE = 100*KB;

    srand(1234);
    void *ptrs[NUM_ALLOCS];
    for ( int i=0; i<REPEATS; ++i ) {
        size_t size = (MAX_SIZE-MIN_SIZE) * rand() / RAND_MAX + MIN_SIZE;
        for ( int j=0; j<NUM_ALLOCS; ++j) {
            ptrs[j] = use_default_malloc ? malloc( size ) : scalable_malloc( size );
            memset( ptrs[j], 0, size );
        }
        for ( int j=0; j<NUM_ALLOCS; ++j)
            use_default_malloc ? free( ptrs[j] ) : scalable_free( ptrs[j] );
    }
}

The second test allocates 1000000 objects of the same size – 100 bytes:

void test2(  bool use_default_malloc ) {
    const size_t SIZE = 100;
    const int NUM_ALLOCS = 1000000;
	
    void *ptrs[NUM_ALLOCS];
    for ( int i=0; i<100; ++i ) {
        for ( int j=0; j<NUM_ALLOCS; ++j)
            ptrs[j] = use_default_malloc ? malloc( SIZE ) : scalable_malloc( SIZE );
        for ( int j=0; j<NUM_ALLOCS; ++j)
            use_default_malloc ? free( ptrs[j] ) : scalable_free( ptrs[j] );
    }
}

The third test allocates a sequence of objects, each growing in size. This is done to prevent reusing internal buffers created by Intel TBB scalable allocator:

void test3(  bool use_default_malloc ) {
    const size_t SIZE_STEP = 8*KB;
    const size_t MAX_SIZE = 32*MB;
    const int NUM_ALLOCS = 10;

    void *ptrs[NUM_ALLOCS];
    for ( size_t size=SIZE_STEP; size<MAX_SIZE; size+=SIZE_STEP ) {
        for ( int j=0; j<NUM_ALLOCS; ++j)
            ptrs[j] = use_default_malloc ? malloc( size ) : scalable_malloc( size );
        for ( int j=0; j<NUM_ALLOCS; ++j)
            use_default_malloc ? free( ptrs[j] ) : scalable_free( ptrs[j] );
    }
}

A single test_func runs all the tests:

typedef void (*test_func)( bool );
test_func run_test[] = { test1, test2, test3 };

void report_memory_usage();

The main() function runs the examples. The user can control the test number, the soft limit in megabytes and which allocation mode to use: default or tbbmalloc:

int main( int argc, char* argv[] ) {
    if ( argc < 3 ) {
        std::cout << "soft_limit_test.exe <test_num> <use_soft_limit> [use_default_malloc]"<< std::endl;
        return EXIT_FAILURE;
    }

    int test_num = atoi( argv[1] );
    size_t soft_limit = atoi( argv[2] ) * MB;
    bool use_default_malloc = argc > 3 && atoi( argv[3] ) == 1;

    if ( test_num > sizeof(run_test)/sizeof(test_func) ) {
        std::cout << "no test #"<< test_num << std::endl;
        return EXIT_FAILURE;
    }

Set the tbbmalloc soft limit as follows:

    if ( soft_limit ) {
        if ( scalable_allocation_mode( TBBMALLOC_SET_SOFT_HEAP_LIMIT, soft_limit ) != TBBMALLOC_OK ) {
            std::cout << "Soft Limit enabling has failed!"<< std::endl;
            return EXIT_FAILURE;
        }

        std::cout << "Soft Limit enabled: "<< soft_limit / MB << "MB."<< std::endl;
    }

    std::cout << "Test #"<< test_num << ":";
    tbb::tick_count t0 = tbb::tick_count::now();

The tests are run in parallel, using OpenMP*. It actually doesn’t matter what threading model to use - OpenMP is chosen just to highlight the fact that tbbmalloc can be used not only with TBB.

#pragma omp parallel
    run_test[test_num-1]( use_default_malloc );
    tbb::tick_count t1 = tbb::tick_count::now();
    std::cout << " done."<< std::endl;

    report_memory_usage();

    std::cout << "Elapsed time: "<< (t1-t0).seconds() << " seconds."<< std::endl;

    return EXIT_SUCCESS;
}

The code to measure memory consumption:

#if _WIN32
#include <Windows.h>
#include <Psapi.h>
void report_memory_usage()
{
    PROCESS_MEMORY_COUNTERS pmc;
    if ( HANDLE h = OpenProcess(  PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, FALSE, GetCurrentProcessId() ) ) {
        if ( GetProcessMemoryInfo( h, &pmc, sizeof(pmc)) )
            std::cout << "Peak Memory Consumption: "<< pmc.PeakPagefileUsage/MB << " MB."<< std::endl;
        CloseHandle( h );
    }
}
#else
#include <cstdio>
#include <cstring>
void report_memory_usage() {
    if ( FILE *stat_f = fopen("/proc/self/status", "r") ) {
        size_t sz = 0;
        char buf[200];
        while ( fgets(buf, 200, stat_f) && !sz )
            if (0 == strncmp(buf, "VmPeak:", strlen("VmPeak:")))
                sscanf(buf, "VmPeak:  %lu kB", &sz);
        if ( sz )
            std::cout << "Peak Memory Consumption: "<< sz/1024 << " MB."<< std::endl;
        fclose(stat_f);
    }
}
#endif

Test results

The tests were performed on two different machines – see configurations details below.

Each test was performed without the soft limit and then with the soft limits of 500 MB, 400 MB and 300 MB, then with the default allocator. Number of threads performing memory allocations corresponds to number of logical CPUs on the test machines.

NOTE! The results differ from run to run, but the general trend in memory consumption is the same on the test configurations described below. In another test environment the tests may perform differently. These results are provided only for demonstration; this is not a general rule for performance and memory consumption. Memory consumption and elapsed time can be compared for one test only (within a column in the table), but not for different tests.

System configuration #1
CPU: Intel® Core™ i5-2540M processor, 2 Cores 4 threads
RAM: 4GB
OS: Microsoft* Windows* 7 x64 Version 11.0.60610.01 Update 3Compiler: Intel® C++ Composer XE 2013 SP1 update 1
Intel(R) TBB version: Intel(R) TBB 4.2 update 3
Application configuration: 64-bit
Compiler command line: /GS /Qopenmp /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc110.pdb" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qstd=c++11 /Qipo /Zc:forScope /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\soft_limit_test.pch"

Test results for system #1

	Test #1		Test #2		Test #3
	Peak Memory Consumption, MB	Elapsed time, seconds	Peak Memory Consumption, MB	Elapsed time, seconds	Peak Memory Consumption, MB	Elapsed time, seconds
tbbmalloc no soft limit	607	5.58	483	10.50	1716	0.46
tbbmalloc soft limit = 500 MB	505	5.59	459	10.69	1131	0.86
tbbmalloc soft limit = 400 MB	404	5.39	481	10.51	1118	1.02
tbbmalloc soft limit = 300 MB	405	5.40	487	10.51	1152	1.09
Default malloc	389	10.76	439	15.04	1090	1.15

System configuration #2
CPU: Intel(R) Xeon(R) CPU E31275 3.40GHz, 4 cores 8 threads
RAM: 16 GB
OS: Microsoft* Windows* 8.1 Enterprise x64
Compiler: Intel® C++ Composer XE 2013 SP1 Update 2 (Compiler 14.0.1290.11)
Intel(R) TBB version: Intel(R) TBB 4.2 update 3
Application configuration: 64-bit
Compiler command line: /GS /Qopenmp /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc110.pdb" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\soft_limit_test.pch"

Test results for system #2

	Test #1		Test #2		Test #3
	Peak Memory Consumption, MB	Elapsed time, seconds	Peak Memory Consumption, MB	Elapsed time, seconds	Peak Memory Consumption, MB	Elapsed time, seconds
tbbmalloc no soft limit	774	5.47	954	9.90	2675	1.00
tbbmalloc soft limit = 500 MB	754	5.43	957	9.98	2092	3.09
tbbmalloc soft limit = 400 MB	675	5.22	951	9.83	2046	3.17
tbbmalloc soft limit = 300 MB	686	5.47	949	9.81	2123	3.22
Default malloc	803	17.21	950	16.73	2103	17.45

In test #1 (random size allocations) setting the soft limit has visible effect on memory consumption, and performance is not affected. The default allocator has smaller memory footprint on system #1, but runs significantly slower than tbbmalloc on both systems.

In test #2 (multiple allocations of the same size) setting up a soft limit doesn’t have any effect on memory consumption: allocating constant size blocks allows reusing of internal buffers, so tbbmalloc optimizes its footprint enough already – there is no room to be gained this way. The default allocator consumes a bit less memory on system #1 and the same as tbbmalloc on system #2. Execution time is bigger for default allocator.

Test #3 (growing block size) shows a visible effect from setting a soft limit: memory consumption decreases. This allocation pattern causes tbbmalloc to fill out internal buffers, so freeing them gives significant decreasing of memory footprint. On the over hand, execution time increases significantly as well – minimizing memory footprint comes with a cost in this case. The default allocator is on par with tbbmalloc with soft limit enabled, in terms of memory consumption, execution time on system #2 is much worse, on system #1 it is close to tbbmalloc.

Summary

The Intel TBB allocator soft limit feature may help you to decrease memory consumption, but its effect is pretty much dependent on the application, platform and memory allocation patterns. In the provided example you can see an effect with random-size and growing-size memory blocks, but no effect with constant-a size blocks. However, this is not a rule of thumb. When you consider using tbbmalloc and with a soft limit in your application, you should always test with your own workloads to check performance and memory footprint implications.

You can find the most recent Intel TBB version on Intel’s commercial and open source sites.

*Other brands and names are the property of their respective owners.

Controlling memory consumption with Intel® Threading Building Blocks (Intel® TBB) scalable allocator

Test example

Test results

Summary

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112