Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is default behaviour) are executed sequentially.
So, for example,
kernel1<<<X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until ememory is copied, memory copy starts only
These are all barriers. Barriers prevent code execution beyond the barrier until some condition is met.
cudaDeviceSynchronize
.
Deprecated just means that it still works for now, but it‘s recommended not to use it (use cudaDeviceSynchronize instead) and in the future, it may become unsupported. But cudaThreadSynchronize
()
and cudaDeviceSynchronize
()
are basically identical.cudaStreamSynchronize
()
takes a stream id as it‘s only parameter. cuda tasks issued in other streams may or may not be complete when CPU code execution continues beyond this barrier.原文:http://blog.csdn.net/mathgeophysics/article/details/19905935