Why Must Cuda Divide Computation Twice: Into Grids And Then Blocks?

1.	Why Must Cuda Divide Computation Twice: Into Grids And Then Blocks?
Answer» The hardware is based on maximizing THROUGHPUT. This has been done by allowing a large number of running threads -- all with a live context. This implies that only a fixed number of threads can FIT in the hardware. This in turn means that these threads cannot communicate with or depend on other thread that could not be fit and hence must WAIT for the first SET of threads to complete execution. Hence, a two level decomposition. Further, even the set of threads running TOGETHER may execute at different SMs, and synchronization across SMs would be slow and onerous and hence not supported. The hardware is based on maximizing throughput. This has been done by allowing a large number of running threads -- all with a live context. This implies that only a fixed number of threads can fit in the hardware. This in turn means that these threads cannot communicate with or depend on other thread that could not be fit and hence must wait for the first set of threads to complete execution. Hence, a two level decomposition. Further, even the set of threads running together may execute at different SMs, and synchronization across SMs would be slow and onerous and hence not supported.

Answer»

The hardware is based on maximizing THROUGHPUT. This has been done by allowing a large number of running threads -- all with a live context. This implies that only a fixed number of threads can FIT in the hardware. This in turn means that these threads cannot communicate with or depend on other thread that could not be fit and hence must WAIT for the first SET of threads to complete execution. Hence, a two level decomposition. Further, even the set of threads running TOGETHER may execute at different SMs, and synchronization across SMs would be slow and onerous and hence not supported.

The hardware is based on maximizing throughput. This has been done by allowing a large number of running threads -- all with a live context. This implies that only a fixed number of threads can fit in the hardware. This in turn means that these threads cannot communicate with or depend on other thread that could not be fit and hence must wait for the first set of threads to complete execution. Hence, a two level decomposition. Further, even the set of threads running together may execute at different SMs, and synchronization across SMs would be slow and onerous and hence not supported.

Why Must Cuda Divide Computation Twice: Into Grids And Then Blocks?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment