c++ - Why doesn't CUDA synchronization point prevent race condition? -
we run cuda-memcheck --tool racecheck <executable>
on our code. following memory hazard errors.
========= race reported between read access @ 0x00004098 cuda.cu:123:kernelfunction() ========= , write access @ 0x00005058 in cuda.cu:146:kernelfunction() [529996 hazards]
here's code. claims line 123 value = sharedmemory0[sharedmemoryindex];
in race condition line 146 sharedmemory0[sharedindex0] = sharedmemory1[sharedindex1];
. have
// synchronization point 1 __syncthreads(); __threadfence_block();
between 2 lines. shouldn't threads synchronize @ point , previous memory read/writes complete @ point? threads , memory accesses should complete after first j-loop before starting second j-loop. in our minds synchronization point 1 should isolate 2 j-loops , prevent race condition, tool says that's not true.
why tool reporting race condition? insights prevent it?
we've seen references tool might able report trace of execution more see race condition. tool , options can use trace see more why race condition exists?
(i = 0; < count0; i++) { // synchronization point 0 __syncthreads(); __threadfence_block(); (j = 0; j < count1; j++) { index = j*blockdim.x + threadidx.x; if (index < thread_count0) { (k = 0; k < count2; k++) sharedmemoryindex = function0(index); value = sharedmemory0[sharedmemoryindex]; } } } // synchronization point 1 __syncthreads(); __threadfence_block(); (j = 0; j < count2; j++) { index = j*blockdim.x + threadidx.x; if (index < thread_count1) { sharedindex0 = function1(index); sharedindex1 = function2(index); sharedmemory0[sharedindex0] = sharedmemory1[sharedindex1]; } } }
we've run synccheck tool, cuda-memcheck --tool synccheck <executable>
, reported following error on synchronization point 1. there's strong correlation between 2 errors, there isn't documentation in cuda-memcheck guide synchronization of divergent code is, why it's bad, , how fix it.
any comments?
========= barrier error detected. encountered barrier divergent threads in block ========= @ 0x00004ad8 in cuda.cu:139:kernelfunction() ========= thread (0,0,0) in block (8,0,0)
there not enough data pinpoint problem accurately. however, last error message crucial:
barrier error detected. encountered barrier divergent threads in block
seems 1 of threads in block reaches barrier while other not, because in branch not taken. note, divergent branches occur not in if
conditions, in loops, if looping condition different between threads in block.
when threads miss __syncthreads()
because of strange things may happen. in practice, means threads stop @ different __syncthreads()
, system thinks in sync when not. can later lead racing situations describe.
so - find divergent __syncthreads()
-- that's cause of problems. problem before snippet included.
also:
- is
i
local variable (not shared)? - is
count0
same threads in block?
Comments
Post a Comment