Thursday, January 26, 2017

C++ tips, 2017 Week 3 (16-Jan - 22-Jan-2017)

This is part of my weekly C++ posts based on the daily C++ tips I make at my work. I strongly recommend this practice. If you dont have it in your company start it. 
List of all weekly posts can be found here.

1. std::is_trivially_copyable

std::is_trivially_copyable tests a type (or a C-style array) if its a TriviallyCopyable type. In short this means if you can and you are allowed to just memcpy one object into another. "If you can" (obviously you can memcpy everything but...) means that the type does not have virtual member function, virtual base class or any non-trivial (user provided) copy/move operations or destructor. Providing any user defined copy/move operation or destructor signals the compiler that there is something special about this type and it should avoid some optimizations. "You are allowed" means that at least one move/copy operations is not deleted and the destructor is not deleted too.



We can guard our types with static_assert to make sure they stay trivially copyiable. For example:
template <class T>
struct point {
T x;
T y;
};

struct MemcpyMe {
double a = 0;
long b[3] = { 0, 1, 2 };
std::array<point<long>, 2> pts;
};

static_assert(std::is_trivially_copyable_v<MemcpyMe> == true, "The struct name!");
So if someone tries to modify something, for example adding a destructor to the point struct, it wont compile.

Another benefit is that STL is aware of trivial classes and enables memcpy/memmove optimizations if it decides it is suitable but before relying on that make sure what your STL implementation does and under what circumstances.

2. von Neumann bottleneck

von Neumann bottleneck is the throughput limitation due to insufficient rate of data transfer between the memory and the CPU. As hardware progressed CPU speed for the money invested increased much faster than the memory speed. So hardware vendors decided to create buffers between the RAM and the CPU - L1, L2, etc cache. The closer to the CPU it is the faster, the smaller and more expensive it is. Size of L1 cache - the closest to the CPU is usually tens of KBs, L2 is hundreds of KBs, L3 is several MBs. In theory you can make a CPU with MBs of L1 cache but it will be ridiculously expensive and probably this cache will go to waste if it is not designed to do very specific task all of the time.

Because of the bottleneck we end up with the CPU waiting for data to come from RAM through all cache levels and back. Hardware vendors apply various strategies to handle this. One example is branch prediction - if the CPU has to evaluate a condition to decide which branch of the if to take and it has to wait for some data in order to do the evaluation, the branch predictor decides that one of the branches is more probable and the CPU starts executing it. When the data arrives and if the prediction is correct - win! else CPU rolls back and starts evaluating the other branch.

However those are general purpose heuristics and one can not depend on them too much so the software developers have to do their part too - data locality, aligned structs, etc.

3. std::filesystem

From C++17 we'll have the std::filesystem library as part of the STL which provides portable way to work with the file system and its components - paths, files, etc. It is based on boost::filesystem that has been around for quite long.

Simple example of using recursive_directory_iterator:
using namespace std::filesystem;
for (auto& p : recursive_directory_iterator(current_path()))
std::cout << p << '\n';
this will print all the files in the current folder and its subfolders. recursive_directory_iterator returns has non-member begin and end functions and operator++ - that's why the range-based for loop works.

We can easily define a range with it and in the future we'll be able to use it with the Ranges library. 

4. PFL Colony

pfl::colony is a container written by Matthew Bentley optimized for rapid insertion and deletion of elements which does not invalidates the pointers to the non-erased elements. I've heard about it on CppCast.

Motivation for developing it comes from game engine development and I found interesting to read about the use cases and the problems they hit while developing game engines.

There is a excellent talk from him at CppCon 2016.

5. Amortized analysis

Amortized analysis is method for analyzing an algorithm's running time. It provides an upper bound of the expense of an operation evaluated over a sequence of operations. It differs from worst-case-per-operation upper bound where the evaluation of the sequence is the sum of the worst case of every operation by looking at the sequence as a whole thus allowing one operation to have a huge cost so the sequential operations are lest costly.

A typical example is std::vector. If the reserved space is full before a push_back it reallocates with double capacity and moves elements to the new memory location - O(N) complexity - however the next n elements are inserted at the back with O(1) complexity. Thus pushing back in a std::vector has O(1) amortized complexity.

Here is a nice paper from Rebecca Fiebrink about amortized analysis.