The Intel® Threading Building Blocks (Intel® TBB) library provides an alternative way to allocate dynamic memory - Intel TBB scalable allocator (tbbmalloc). Its purpose is to provide better performance and scalability for memory allocation/deallocation operations in multithreaded applications, compared to the default allocator. But performance improvement comes at a price: increased memory consumption.
If memory footprint becomes a problem, the developers have now the ability to reduce it. This is possible with the new “soft limit” feature that became available in Intel TBB 4.2 update 1. The idea is that the developer sets a limit in bytes by using the scalable_allocation_mode (TBBMALLOC_SET_SOFT_HEAP_LIMIT, <size>) method. When the limit is reached, memory is released from internal buffers in tbbmalloc, decreasing memory consumption. This doesn’t prevent further allocations, just clears the buffers.
NOTE! This is a soft limit, which means there is no guarantee that memory consumption will not exceed it. Setting the limit informs tbbmalloc to change the strategy – it buffers less memory thereby becoming less “greedy,” but it might be a bit slower.
Test example
The following example demonstrates the effect of using the soft limit. You can build the sample by simply putting the code pieces together. Also, the complete source is attached for your convenience. This example implements three tests with different memory allocation patterns – random-size objects, constant-size objects and growing-size objects. Each of the three tests has two modes: using tbbmalloc or using standard malloc.
The first test allocates 1000 objects of random size, and then frees them. This action is repeated 100 times:
#include <iostream> #include <cstdlib> #include <stdio.h> #include <tbb/scalable_allocator.h> #include <tbb/tick_count.h> const size_t KB = 1024; const size_t MB = 1024*KB; void test1( bool use_default_malloc ) { const int REPEATS = 100; const int NUM_ALLOCS = 1000; const size_t MIN_SIZE = 8; const size_t MAX_SIZE = 100*KB; srand(1234); void *ptrs[NUM_ALLOCS]; for ( int i=0; i<REPEATS; ++i ) { size_t size = (MAX_SIZE-MIN_SIZE) * rand() / RAND_MAX + MIN_SIZE; for ( int j=0; j<NUM_ALLOCS; ++j) { ptrs[j] = use_default_malloc ? malloc( size ) : scalable_malloc( size ); memset( ptrs[j], 0, size ); } for ( int j=0; j<NUM_ALLOCS; ++j) use_default_malloc ? free( ptrs[j] ) : scalable_free( ptrs[j] ); } }
The second test allocates 1000000 objects of the same size – 100 bytes:
void test2( bool use_default_malloc ) { const size_t SIZE = 100; const int NUM_ALLOCS = 1000000; void *ptrs[NUM_ALLOCS]; for ( int i=0; i<100; ++i ) { for ( int j=0; j<NUM_ALLOCS; ++j) ptrs[j] = use_default_malloc ? malloc( SIZE ) : scalable_malloc( SIZE ); for ( int j=0; j<NUM_ALLOCS; ++j) use_default_malloc ? free( ptrs[j] ) : scalable_free( ptrs[j] ); } }
The third test allocates a sequence of objects, each growing in size. This is done to prevent reusing internal buffers created by Intel TBB scalable allocator:
void test3( bool use_default_malloc ) { const size_t SIZE_STEP = 8*KB; const size_t MAX_SIZE = 32*MB; const int NUM_ALLOCS = 10; void *ptrs[NUM_ALLOCS]; for ( size_t size=SIZE_STEP; size<MAX_SIZE; size+=SIZE_STEP ) { for ( int j=0; j<NUM_ALLOCS; ++j) ptrs[j] = use_default_malloc ? malloc( size ) : scalable_malloc( size ); for ( int j=0; j<NUM_ALLOCS; ++j) use_default_malloc ? free( ptrs[j] ) : scalable_free( ptrs[j] ); } }
A single test_func runs all the tests:
typedef void (*test_func)( bool ); test_func run_test[] = { test1, test2, test3 }; void report_memory_usage();
The main() function runs the examples. The user can control the test number, the soft limit in megabytes and which allocation mode to use: default or tbbmalloc:
int main( int argc, char* argv[] ) { if ( argc < 3 ) { std::cout << "soft_limit_test.exe <test_num> <use_soft_limit> [use_default_malloc]"<< std::endl; return EXIT_FAILURE; } int test_num = atoi( argv[1] ); size_t soft_limit = atoi( argv[2] ) * MB; bool use_default_malloc = argc > 3 && atoi( argv[3] ) == 1; if ( test_num > sizeof(run_test)/sizeof(test_func) ) { std::cout << "no test #"<< test_num << std::endl; return EXIT_FAILURE; }
Set the tbbmalloc soft limit as follows:
if ( soft_limit ) { if ( scalable_allocation_mode( TBBMALLOC_SET_SOFT_HEAP_LIMIT, soft_limit ) != TBBMALLOC_OK ) { std::cout << "Soft Limit enabling has failed!"<< std::endl; return EXIT_FAILURE; } std::cout << "Soft Limit enabled: "<< soft_limit / MB << "MB."<< std::endl; } std::cout << "Test #"<< test_num << ":"; tbb::tick_count t0 = tbb::tick_count::now();
The tests are run in parallel, using OpenMP*. It actually doesn’t matter what threading model to use - OpenMP is chosen just to highlight the fact that tbbmalloc can be used not only with TBB.
#pragma omp parallel run_test[test_num-1]( use_default_malloc ); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << " done."<< std::endl; report_memory_usage(); std::cout << "Elapsed time: "<< (t1-t0).seconds() << " seconds."<< std::endl; return EXIT_SUCCESS; }
The code to measure memory consumption:
#if _WIN32 #include <Windows.h> #include <Psapi.h> void report_memory_usage() { PROCESS_MEMORY_COUNTERS pmc; if ( HANDLE h = OpenProcess( PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, FALSE, GetCurrentProcessId() ) ) { if ( GetProcessMemoryInfo( h, &pmc, sizeof(pmc)) ) std::cout << "Peak Memory Consumption: "<< pmc.PeakPagefileUsage/MB << " MB."<< std::endl; CloseHandle( h ); } } #else #include <cstdio> #include <cstring> void report_memory_usage() { if ( FILE *stat_f = fopen("/proc/self/status", "r") ) { size_t sz = 0; char buf[200]; while ( fgets(buf, 200, stat_f) && !sz ) if (0 == strncmp(buf, "VmPeak:", strlen("VmPeak:"))) sscanf(buf, "VmPeak: %lu kB", &sz); if ( sz ) std::cout << "Peak Memory Consumption: "<< sz/1024 << " MB."<< std::endl; fclose(stat_f); } } #endif
Test results
The tests were performed on two different machines – see configurations details below.
Each test was performed without the soft limit and then with the soft limits of 500 MB, 400 MB and 300 MB, then with the default allocator. Number of threads performing memory allocations corresponds to number of logical CPUs on the test machines.
NOTE! The results differ from run to run, but the general trend in memory consumption is the same on the test configurations described below. In another test environment the tests may perform differently. These results are provided only for demonstration; this is not a general rule for performance and memory consumption. Memory consumption and elapsed time can be compared for one test only (within a column in the table), but not for different tests.
System configuration #1
CPU: Intel® Core™ i5-2540M processor, 2 Cores 4 threads
RAM: 4GB
OS: Microsoft* Windows* 7 x64 Version 11.0.60610.01 Update 3Compiler: Intel® C++ Composer XE 2013 SP1 update 1
Intel(R) TBB version: Intel(R) TBB 4.2 update 3
Application configuration: 64-bit
Compiler command line: /GS /Qopenmp /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc110.pdb" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qstd=c++11 /Qipo /Zc:forScope /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\soft_limit_test.pch"
Test results for system #1
| Test #1 | Test #2 | Test #3 | |||
| Peak Memory Consumption, MB | Elapsed time, seconds | Peak Memory Consumption, MB | Elapsed time, seconds | Peak Memory Consumption, MB | Elapsed time, seconds |
tbbmalloc | 607 | 5.58 | 483 | 10.50 | 1716 | 0.46 |
tbbmalloc | 505 | 5.59 | 459 | 10.69 | 1131 | 0.86 |
tbbmalloc | 404 | 5.39 | 481 | 10.51 | 1118 | 1.02 |
tbbmalloc | 405 | 5.40 | 487 | 10.51 | 1152 | 1.09 |
Default malloc | 389 | 10.76 | 439 | 15.04 | 1090 | 1.15 |
System configuration #2
CPU: Intel(R) Xeon(R) CPU E31275 3.40GHz, 4 cores 8 threads
RAM: 16 GB
OS: Microsoft* Windows* 8.1 Enterprise x64
Compiler: Intel® C++ Composer XE 2013 SP1 Update 2 (Compiler 14.0.1290.11)
Intel(R) TBB version: Intel(R) TBB 4.2 update 3
Application configuration: 64-bit
Compiler command line: /GS /Qopenmp /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc110.pdb" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\soft_limit_test.pch"
Test results for system #2
| Test #1 | Test #2 | Test #3 | |||
| Peak Memory Consumption, MB | Elapsed time, seconds | Peak Memory Consumption, MB | Elapsed time, seconds | Peak Memory Consumption, MB | Elapsed time, seconds |
tbbmalloc | 774 | 5.47 | 954 | 9.90 | 2675 | 1.00 |
tbbmalloc | 754 | 5.43 | 957 | 9.98 | 2092 | 3.09 |
tbbmalloc | 675 | 5.22 | 951 | 9.83 | 2046 | 3.17 |
tbbmalloc | 686 | 5.47 | 949 | 9.81 | 2123 | 3.22 |
Default malloc | 803 | 17.21 | 950 | 16.73 | 2103 | 17.45 |
In test #1 (random size allocations) setting the soft limit has visible effect on memory consumption, and performance is not affected. The default allocator has smaller memory footprint on system #1, but runs significantly slower than tbbmalloc on both systems.
In test #2 (multiple allocations of the same size) setting up a soft limit doesn’t have any effect on memory consumption: allocating constant size blocks allows reusing of internal buffers, so tbbmalloc optimizes its footprint enough already – there is no room to be gained this way. The default allocator consumes a bit less memory on system #1 and the same as tbbmalloc on system #2. Execution time is bigger for default allocator.
Test #3 (growing block size) shows a visible effect from setting a soft limit: memory consumption decreases. This allocation pattern causes tbbmalloc to fill out internal buffers, so freeing them gives significant decreasing of memory footprint. On the over hand, execution time increases significantly as well – minimizing memory footprint comes with a cost in this case. The default allocator is on par with tbbmalloc with soft limit enabled, in terms of memory consumption, execution time on system #2 is much worse, on system #1 it is close to tbbmalloc.
Summary
The Intel TBB allocator soft limit feature may help you to decrease memory consumption, but its effect is pretty much dependent on the application, platform and memory allocation patterns. In the provided example you can see an effect with random-size and growing-size memory blocks, but no effect with constant-a size blocks. However, this is not a rule of thumb. When you consider using tbbmalloc and with a soft limit in your application, you should always test with your own workloads to check performance and memory footprint implications.
You can find the most recent Intel TBB version on Intel’s commercial and open source sites.
*Other brands and names are the property of their respective owners.