a possible low-level optimization

首页 > 代码库 > a possible low-level optimization

a possible low-level optimization

2024-08-22 18:30:25 218人阅读

http://www.1point3acres.com/bbs/thread-212960-1-1.html

第二轮白人小哥，一开始问了一道至今不懂的问题，好像是给一个vector<uint8_t> nums, 然后又给一个256位的vector<int> counts，遍历nums，然后counts[nums]++，问如何进行优化，提示说要用到CPU cache之类的东西(完全不知道)。小白哥见我懵逼，后来又给了一道3sum，迅速做出。

uint8_t input[102400];uint32_t count[256];void count_it(){    for (int i = 0; i < sizeof(input) / sizeof(input[0]); i++) {        ++count[input[i]];    }}

how to optimize? possible points to consider:

a) target "count" array size is 4B*256=1KB, which can fit into L1 cache, so no need to worry about that;

b) input array access is sequential, which is actually cache friendly;

c) update to "count" could have false sharing, but given it‘s all in L1 cache, that‘s fine;

d) optimization 1: the loop could be unrolled to reduce loop check;

e) optimization 2: input array could be pre-fetched (i.e. insert PREFETCH instructions beforehand);

    for (int i = 0; i < sizeof(input) / sizeof(input[0]);) {        // typical cache size is 64 bytes        __builtin_prefetch(&input[i+64], 0, 3); // prefetch for read, high locality        for (int j = 0; j < 8; j++) {            int k = i + j * 8;            ++count[input[k]];            ++count[input[k+1]];            ++count[input[k+2]];            ++count[input[k+3]];            ++count[input[k+4]];            ++count[input[k+5]];            ++count[input[k+6]];            ++count[input[k+7]];        }        i += 64;    }

(see https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/Other-Builtins.html for __builtin_prefetch)

f) optimization 3: multi-threading, but need to use lock instruction when incrementing the count;

g) optimization 4: vector extension CPU instructions: "gather" instruction to load sparse locations (count[xxx]) to a zmmx register (512bit, 64byte i.e. 16 integers), then it can process 16 input uchar8_t in one go; then add a constant 512bit integer which adds 1 to each integer. corresponding "scatter" instruction will store back the updated count.

a possible low-level optimization

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > a possible low-level optimization

a possible low-level optimization

看完仍有疑问？有类似问题直接问程序猿