DescriptionMassive parallel processing power of GPUs has attracted researchers to develop iterative vertex-centric graph processing frameworks for GPUs. Enabling work-efficiency in these solutions, however, is not straightforward and comes at the cost of SIMD-inefficiency and load imbalance. This paper offers techniques that overcome these challenges when processing the graph on a GPUs. For a SIMD-efficient kernel operation involving gathering of neighbors and performing reduction on them, we employ an effective task expansion strategy that avoids intra-warp thread underutilization. As recording vertex activeness requires additional data structures, to attenuate the graph storage overhead on limited GPU DRAM, we introduce vertex grouping as a technique that enables trade-off between memory consumption and the work efficiency in our solution. Our experiments show that these techniques provide up to 5.46x over the recently proposed WS-VR framework over multiple algorithms and inputs.