InterviewSolution
This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.
| 1. |
How Long Does The Parallel Version Of Prim’s Minimum Spanning Tree Finding Algorithm Require For A Graph With N Nodes Using P Processors? |
|
Answer» O(n2/p + n log p) |
|
| 2. |
How Can One Ensure Mutual Exclusion Without Locks? |
|
Answer» When references of two (or more) threads (or PROCESSES) may be serialized with respect to a variable, system primitives LIKE compare and swap can help detect the conflict with ANOTHER thread. Lock free implementations of a thread USUALLY detect the conflict atomically (e.g., using compare and swap) and one succeeds while the other BACKS off and retries. When references of two (or more) threads (or processes) may be serialized with respect to a variable, system primitives like compare and swap can help detect the conflict with another thread. Lock free implementations of a thread usually detect the conflict atomically (e.g., using compare and swap) and one succeeds while the other backs off and retries. |
|
| 3. |
When Stealing Load From A Random Loaded Processor, What Type Of Synchronization Is Needed? |
|
Answer» One needs to make SURE that the QUEUE being stolen from is operated in a SYNCHRONIZED fashion – either LOCKED or EDITED in a lock-free manner. One needs to make sure that the queue being stolen from is operated in a synchronized fashion – either locked or edited in a lock-free manner. |
|
| 4. |
How Fast Can A List Be Sorted Using N Processors Using Local Sorting Of N/p Elements Each Followed By Optimal Multi-way Merge? |
|
Answer» O(n/p log n) |
|
| 5. |
How Fast Can Two Sorted Lists Of Size N Each Be Merged Into One Using P Processors? |
|
Answer» O(n/p) TIME USING OPTIMAL multi-way MERGE. O(n/p) time using optimal multi-way merge. |
|
| 6. |
In Order To Balance Load For Parallel Bucket Sort Of N Elements, Uniformly Spaced Splitters Need To Be Selected. This Can Be Done By First Dividing The List Into B Lists And Choosing B Equi-spaced Samples From Each. The Final B Splitters Are Chosen Uniformly Spaced From These Samples. How Balanced Are The Buckets If These Splitters Are Used? |
|
Answer» No BUCKET will CONTAIN more than 2n/B ELEMENTS. No bucket will contain more than 2n/B elements. |
|
| 7. |
How Long Does Batcher’s Odd-even Merge Require? |
|
Answer» O(LOG N) TIME, O(n log n) WORK O(log n) time, O(n log n) work |
|
| 8. |
How Long Does Bitonic Sorting Require On Pram? |
|
Answer» O(LOG2N) O(log2n) |
|
| 9. |
How Can Prefix Minima Be Found In O(1) Time? |
|
Answer» This can be computed by FIRST finding all nearest smaller VALUES first in O(1) and then CHECKING in O(1) time for each element (USING O(n) processor for that element), that largest index smaller than its own, whose element has no nearest smaller value on its left. The work complexity of O(n2) can be improved using accelerated cascading. This can be computed by first finding all nearest smaller values first in O(1) and then checking in O(1) time for each element (using O(n) processor for that element), that largest index smaller than its own, whose element has no nearest smaller value on its left. The work complexity of O(n2) can be improved using accelerated cascading. |
|
| 10. |
How Can Two Gpu Threads Communicate Through Shared Memory? |
|
Answer» If the threads belong to a non-divergent warp, writes before reads are visible to the read. Two threads in the same block must have an intervening SYNC for the write to affect the read. Two thread in DIFFERENT blocks WITHIN the same kernel cannot be guaranteed an order and the read must be moved to a LATER kernel for the write to BECOME visible. If the threads belong to a non-divergent warp, writes before reads are visible to the read. Two threads in the same block must have an intervening sync for the write to affect the read. Two thread in different blocks within the same kernel cannot be guaranteed an order and the read must be moved to a later kernel for the write to become visible. |
|
| 11. |
How Do Memory Operations In Gpus Differ From Those In Cpus? |
|
Answer» GPUs have a significantly smaller cache making average latency of memory operations much HIGHER. This requires many concurrent threads to hid the latency. Also, the shared memory can be USED as an opaque cache in direct CONTROL of the programmer -- making it possible to UTILIZE the cache better in some situations. Further, because of SIMD WARP instructions, multiple memory accesses are made per instruction. These accesses can be coalesced into a smaller number of real accesses, if the address set is contiguous for global memory or strided for shared memory. GPUs have a significantly smaller cache making average latency of memory operations much higher. This requires many concurrent threads to hid the latency. Also, the shared memory can be used as an opaque cache in direct control of the programmer -- making it possible to utilize the cache better in some situations. Further, because of SIMD warp instructions, multiple memory accesses are made per instruction. These accesses can be coalesced into a smaller number of real accesses, if the address set is contiguous for global memory or strided for shared memory. |
|
| 12. |
Why Must Cuda Divide Computation Twice: Into Grids And Then Blocks? |
|
Answer» The hardware is based on maximizing THROUGHPUT. This has been done by allowing a large number of running threads -- all with a live context. This implies that only a fixed number of threads can FIT in the hardware. This in turn means that these threads cannot communicate with or depend on other thread that could not be fit and hence must WAIT for the first SET of threads to complete execution. Hence, a two level decomposition. Further, even the set of threads running TOGETHER may execute at different SMs, and synchronization across SMs would be slow and onerous and hence not supported. The hardware is based on maximizing throughput. This has been done by allowing a large number of running threads -- all with a live context. This implies that only a fixed number of threads can fit in the hardware. This in turn means that these threads cannot communicate with or depend on other thread that could not be fit and hence must wait for the first set of threads to complete execution. Hence, a two level decomposition. Further, even the set of threads running together may execute at different SMs, and synchronization across SMs would be slow and onerous and hence not supported. |
|
| 13. |
What Is Accelerated Cascading? |
|
Answer» The accelerated cascading technique COMBINES a fast but work-inefficient ALGORITHM with a work optimal one. The problem is RECURSIVELY divided into many smaller sub-problems, which are FIRST solved solved using the optimal algorithm. The sub-results are then combined with the faster VERSION of the algorithm. The accelerated cascading technique combines a fast but work-inefficient algorithm with a work optimal one. The problem is recursively divided into many smaller sub-problems, which are first solved solved using the optimal algorithm. The sub-results are then combined with the faster version of the algorithm. |
|
| 14. |
What Is The Time Complexity Of Optimal Merge Algorithm (on Pram)? |
|
Answer» O(log log N) by FIRST merging sub-sequences of the original lists of SIZE n/(log log n) each. The remaining elements are inserted into the just computed SEQUENCE in the next step. O(log log n) by first merging sub-sequences of the original lists of size n/(log log n) each. The remaining elements are inserted into the just computed sequence in the next step. |
|
| 15. |
What Is The Complexity Of Prefix Sum In Pram Model? |
|
Answer» Time O(log n) and work O(n) |
|
| 16. |
How Cam Mpi Be Used For Shared Memory Style Programming? |
|
Answer» Each process registers its local memory and attaches it to a "window." ACCESSES via this window get TRANSLATED to SEND or fetch requests to the DESIRED member of the group. The pairing communication is handled by the MPI system asynchronously. Each process registers its local memory and attaches it to a "window." Accesses via this window get translated to send or fetch requests to the desired member of the group. The pairing communication is handled by the MPI system asynchronously. |
|
| 17. |
What Is A Collective Communication Call? |
|
Answer» It's a CALL that must be made at all MEMBERS of the COMMUNICATION GROUP. No call can return until all calls have been at least been made. It's a call that must be made at all members of the communication group. No call can return until all calls have been at least been made. |
|
| 18. |
When Can An Mpi Send Call Return? |
|
Answer» If it is a SYNCHRONOUS call, it can return only when the pairing call on another PROCESS is ready. For ASYNCHRONOUS VERSIONS, it can return as soon as the provided buffer is ready for re-use. If it is a synchronous call, it can return only when the pairing call on another process is ready. For asynchronous versions, it can return as soon as the provided buffer is ready for re-use. |
|
| 19. |
What Is A Task Dependency Graph? |
|
Answer» A directed graph with nodes representing tasks and EDGE from TASK a to B indicating that task b can only START after task a is completed. A directed graph with nodes representing tasks and edge from task a to b indicating that task b can only start after task a is completed. |
|
| 20. |
What Is False Sharing? |
|
Answer» Sharing of a cache line by distinct variables. As a RESULT, PERFORMANCE issues come into play. If such variables are not accessed TOGETHER, the un-accessed VARIABLE is unnecessarily brought into cache ALONG with the accessed variable. Sharing of a cache line by distinct variables. As a result, performance issues come into play. If such variables are not accessed together, the un-accessed variable is unnecessarily brought into cache along with the accessed variable. |
|
| 21. |
What Is The Difference Between Processor And Fifo Consistency? |
|
Answer» In FIFO consistency only writes from a SINGLE processor are visible in the order issued. In processor consistency, additionally there EXISTS a global ORDERING of writes to any address x by DIFFERENT PROCESSES exists that is consistent with the local views. In FIFO consistency only writes from a single processor are visible in the order issued. In processor consistency, additionally there exists a global ordering of writes to any address x by different processes exists that is consistent with the local views. |
|
| 22. |
How Does Openmp Provide A Shared-memory Programming Environment.? |
|
Answer» OpenMP uses pragmas to CONTROL automatic creation of threads. Since the thread share the address space, they share memory. HOWEVER, they are ALLOWED a local view of the SHARED variables through “private” variables. The compiler allocates a variable-copy for each thread and OPTIONALLY initializes them with the original variable. Within the thread the references to private variable are statically changed to the new variables. OpenMP uses pragmas to control automatic creation of threads. Since the thread share the address space, they share memory. However, they are allowed a local view of the shared variables through “private” variables. The compiler allocates a variable-copy for each thread and optionally initializes them with the original variable. Within the thread the references to private variable are statically changed to the new variables. |
|
| 23. |
What Is The Diameter Of An N-node Hypercube? |
|
Answer» LOG n. The DIAMETER is the minimum number of links REQUIRED to reach two furthest nodes. log n. The diameter is the minimum number of links required to reach two furthest nodes. |
|
| 24. |
What Is Cache Coherence? |
|
Answer» Different processors may MAINTAIN their own local CACHES. This RESULTS in potentially multiple copies of the same data. Coherence implies that ACCESS to the local copies behave similarly to access from the local COPY – apart from the time to access. Different processors may maintain their own local caches. This results in potentially multiple copies of the same data. Coherence implies that access to the local copies behave similarly to access from the local copy – apart from the time to access. |
|
| 25. |
What Is The Maximum Time Speed-up Possible According To Amdahl's Law? |
|
Answer» 1/F, where f is inherently sequential fraction of the TIME TAKEN by the best sequential EXECUTION of the task. 1/f, where f is inherently sequential fraction of the time taken by the best sequential execution of the task. |
|
| 26. |
What Is Speed-up? |
|
Answer» The ratio of some PERFORMANCE metric (LIKE latency) OBTAINED using a single processor with that obtained using a SET of parallel processors. The ratio of some performance metric (like latency) obtained using a single processor with that obtained using a set of parallel processors. |
|
| 27. |
What Is Task-throughput? |
|
Answer» The NUMBER of TASKS completed in a GIVEN TIME The number of tasks completed in a given time |
|
| 28. |
What Is Task-latency? |
|
Answer» The TIME taken for a TASK to complete SINCE a request for it is MADE. The time taken for a task to complete since a request for it is made. |
|
| 29. |
What Is Task-parallel Computation? |
|
Answer» The PARALLELISM manifests across functions. A SET of functions need to compute, which MAY or may not have order constraints among them. The parallelism manifests across functions. A set of functions need to compute, which may or may not have order constraints among them. |
|
| 30. |
What Is Data-parallel Computation? |
|
Answer» DATA is partitioned ACROSS parallel execution THREADS, each of which PERFORM some computation on its partition – USUALLY independent of other threads. Data is partitioned across parallel execution threads, each of which perform some computation on its partition – usually independent of other threads. |
|