The efficient mapping of algorithms onto parallel architectures is of utmost importance since many state-of-the-art embedded digital systems have to deploy parallelism in order to increase their computational power. This talk deals with the mapping of nested loop programs onto massively parallel processor arrays. We present a unified design methodology in order to achieve highly parallel implementations for two kinds of architectures: (a) dedicated, application-specific arrays and (b) coarse-grained, "weakly programmable" processor arrays. We describe which steps of the design flow can be conducted for both architecture types in common. The hardware synthesis of dedicated hardware accelerators is mostly automated and only relatively few architectural constraints have to be considered. Whereas, when targeting coarse-grained processor arrays, a large number of architectural parameters have to be incorporated during the backend code generation. The proposed unified retargetable design methodology is applied in several case studies. Implementations for both target architectures with respect to performance, area cost, and reconfiguration time are evaluated. The results show that both approaches have their specific benefits and drawbacks.