Details

Title

Execution time prediction model for parallel GPU realizations of discrete transforms computation algorithms

Journal title

Bulletin of the Polish Academy of Sciences Technical Sciences

Yearbook

2022

Volume

70

Issue

1

Affiliation

Puchala, Dariusz : Institute of Information Technology, Łódź University of Technology, ul. Wólczańska 215, 90-924 Łódź, Poland ; Stokfiszewski, Kamil : Institute of Information Technology, Łódź University of Technology, ul. Wólczańska 215, 90-924 Łódź, Poland ; Wieloch, Kamil : Institute of Information Technology, Łódź University of Technology, ul. Wólczańska 215, 90-924 Łódź, Poland

Authors

Keywords

graphics processing unit (GPU) ; execution time prediction model ; discrete wavelet transform (DWT) ; lattice structure ; convolution-based approach ; orthogonal transform ; orthogonal filter banks ; time effectiveness; prediction accuracy

Divisions of PAS

Nauki Techniczne

Coverage

e139393

Bibliography

  1.  U.N. Ahmed and K.R. Rao, Orthogonal Transforms for Digital Signal Process. Secaucus, NJ, USA: Springer-Verlag, New York, Inc., 1974.
  2.  Y. Su and Z. Xu, “Parallel implementation of wavelet-based image denoising on programmable pc-grade graphics hardware,” Signal Process., vol. 90, pp. 2396–2411, 2010, doi: 10.1016/j.sigpro.2009.06.019.
  3.  P. Lipinski and D. Puchala, “Digital image watermarking using fast parametric transforms,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 67, pp. 463–477, 2019.
  4.  K.R. Rao and P. Yip, Discrete cosine transform: algorithms, advantages, applications. San Diego, CA, USA: Academic Press Professional, Inc., 1990.
  5.  D. Salomon, A Guide to Data Compression Methods. New York: Springer-Verlag
  6. D. Puchala and M. Yatsymirskyy, “Joint compression and encryption of visual data using orthogonal parametric transforms,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 64, pp. 373–382, 2016.
  7.  M. Akay, Time Frequency and Wavelets in Biomedical Signal Process., ser. IEEE Press Series in Biomed. Eng. Wiley-IEEE Press, 1998.
  8.  S. Babichev, J. Skvor, J. Fiser, and V. Lytvynenko, “Technology of gene expression profiles filtering based on wavelet analysis,” Int. J. Intell. Syst. Appl., vol. 10, pp. 1–7, 2018.
  9.  Z. Jakovljevic, R. Puzovic, and M. Pajic, “Recognition of planar segments in point cloud based on wavelet transform,” IEEE Trans. Ind. Inf., vol. 11, no. 2, pp. 342–352, 2015.
  10.  J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming. Indianapolis, IN 46256: John Wiley & Sons, Inc., 2014.
  11.  J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010.
  12.  G. Barlas, Multicore and GPU Programming: An Integrated Approach. Morgan Kaufmann Publishers, 2015.
  13.  K. Stokfiszewski and K. Wieloch, “ Time effectiveness optimization of cross correlation methods for OCR systems with the use of graphics processing units,” J. Appl. Comput. Sci., vol. 23, no. 2, pp. 79–100, 2015.
  14.  A. Wojciechowski and T. Gałaj, “GPU supported dual quaternions based skinning,” in Computer Game Innovations. A. Wojciechowski, P. Napieralski (Eds.), Lodz University of Technology Press, 2016, pp. 5–23.
  15.  M. Wawrzonowski, D. Szajerman, M. Daszuta, and P. Napieralski, “Mobile devices’ GPUs in cloth dynamics simulation,” in Proceedings of the Federated Conference on Computer Science and Information Systems. M. Ganzha, L. Maciaszek, M. Paprzycki (Eds.), 2017, pp. 1283–1290.
  16.  D. Puchala, K. Stokfiszewski, B. Szczepaniak, and M. Yatsymirskyy, “Effectiveness of fast fourier transform implementations on GPU and CPU,” Przegla˛d Elektrotechniczny, vol. 92, no. 7, pp. 69–71, 2016.
  17.  K. Stokfiszewski, K. Wieloch, and M. Yatsymirskyy, “The fast Fourier transform partitioning scheme for GPU’s computation effectiveness improvement,” in Advances in Intelligent Systems and Computing II (CSIT), N. Shakhovska and V. Stepashko (Eds.), Springer, Cham, 2017, vol. 689, no. 1, pp. 511–522.
  18.  B.H.H. Juurlink and H.A.G. Wijshoff, “A quantitive comparison of parallel computation models,” ACM Trans. Comput. Syst., vol. 16, no. 3, pp. 271–318, 1988.
  19.  S.G. Akl, Parallel computation. Models and methods. Upple Saddle River, NJ: Prentice Hall, 1997.
  20.  A. Madougou, S. Varbanescu, C. Laat, and R. van Nieuwpoort, “The landscape of GPGPU performance modeling tools,” Parallel Comput., vol. 56, pp. 18–33, 2016.
  21.  H. Sunpyo and K. Hyesoon, “An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness,” ACM SIGARCH Comput. Architect. News, vol. 37, pp. 152–163, 2009.
  22.  C. Luo and R. Suda, “An execution time prediction analytical model for GPU with instruction-level and thread-level parallelism awareness,” IPSJ SIG Tech. Rep., vol. 2011-HPC-130, no. 19, pp. 1–9, 2011.
  23.  M. Amaris, D. Cordeiro, A. Goldman, and R.Y. de Camargo, “A simple BSP-based model to predict execution time in GPU applications,” in Proc. IEEE 22nd International Conference on High Performance Computing (HiPC), 2015, pp. 285–294.
  24.  L. Ma, R.D. Chamberlain, and K. Agrawal, “Performance modeling for highly-threaded many-core GPUs,” in Proc. IEEE 25th International Conference on Application-Specific Systems, Arch’s and Processors, 2014, pp. 84–91.
  25.  K. Kothapalli, R. Mukherjee, M.S. Rehman, S. Patidar, P.J. Narayanan, and K. Srinathan, “A performance prediction model for the CUDA GPGPU platform,” in Proc. International Conference on High Performance Computing (HiPC), 2009, pp. 463–472.
  26.  M. Amaris, R.Y. de Camargo, M. Dyab, A. Goldman, and D. Trystram, “A comparison of GPU execution time prediction using machine learning and analytical modeling,” in Proc. 15th IEEE International Symposium on Network Computing and Applications (NCA), 2016, pp. 326–333.
  27.  A. Karami, S.A. Mirsoleimani, and F. Khunjush, “A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs,” in Proc. 17th CSI Int. Symposium on Computer Architecture & Digital Systems (CADS), 2013, pp. 15–22.
  28.  A. Kerr, E. Anger, G. Hendry, and S. Yalamanchili, “Eiger: A framework for the automated synthesis of statistical performance models,” in Proc. 19th Int. Conference on High Performance Computing, 2012, pp. 1–6.
  29.  Y. Zhang, Y. Hu, B. Li, and L. Peng, “Performance and power analysis of ATI GPU: A statistical approach,” in Proc. 6th IEEE International Conference on Networking, Architecture, and Storage, 2011, pp. 149–158.
  30.  G. Wu, J.L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou, “GPGPU performance and power estimation using machine learning,” in Proc. 21st IEEE Int. Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 564– 576.
  31.  E. Ipek, B. Supinski, M. Schulz, and S. McKee, “An approach to performance prediction for parallel applications,” in Proc. 11th International Euro-Par Conference on Parallel Processing, 2005, pp. 196–205.
  32.  N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu, “Cross architecture performance prediction (XAPP) using CPU code to predict GPU performance,” in Proc. 48th Annual IEEE/ ACM International Symposium on Microarchitecture (MICRO), 2015, pp. 725–737.
  33.  “GPGPU-Sim project.” [Online]. Available: http://www.gpgpu-sim.org.
  34.  A. Bakhoda, W.L. Fung, H. Wong, and G.L. Yuan, “Analyzing CUDA workloads using a detailed GPU simulator,” in Proc. ISPASS International Symposium on Performance Analysis of Systems and Software, 2009, pp. 163–174.
  35.  “GPUSimPow – AES LPGPU Group Power Simulation Project.” [Online]. Available: https://www.aes.tu-berlin.de/menue/forschung/projekte/ gpusimpow_simulator/.
  36.  Z. Yu, L. Eeckhout, N. Goswami, T. Li, L.K. John, H. Jin, C. Xu, and J. Wu, “Accelerating GPGPU micro-architecture simulation,” IEEE Trans. Comput., vol. 64, no. 11, pp. 3153–3166, 2015.
  37.  R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “Multi2Sim: a simulation framework for CPU-GPU computing,” in Proc. 21st International Conf. on Parallel Architectures and Compilation Techniques (PACT), 2012, pp. 335–344.
  38.  G. Malhotra, S. Goel, and S. Sarangi, “GpuTejas: a parallel simulator for GPU architectures,” in Proc. 21st International Conference on High Performance Computing, 2014, pp. 1–10.
  39.  Y. Arafa, A.A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenz, “PPT-GPU: Scalable GPU performance modeling,” IEEE Comput. Archit. Lett., vol. 18, no. 1, pp. 55–58, 2019.
  40.  X. Wang, K. Huang, A. Knoll, and X. Qian, “A hybrid framework for fast and accurate GPU performance estimation through source-level analysis and trace-based simulation,” in Proc. IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 506–518.
  41.  K. Punniyamurthy, B. Boroujerdian, and A. Gerstlauer, “GATSim: Abstract timing simulation of GPUs,” in Proc. Design, Automation & Test, Europe Conf. & Exhibition (DATE), 2017, pp. 43–48.
  42.  M. Khairy, Z. Shen, T.M. Aamodt, and T.G. Rogers, “AccelSim: An extensible simulation framework for validated GPU modeling,” in Proc. 47th IEEE/ACM Int. Symposium on Computer Architecture (ISCA), 2020, pp. 473–486.
  43.  S. Collange, M. Daumas, D. Defour, and D. Parello, “Barra: A parallel functional simulator for GPGPU,” in Proc. IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2010, pp. 351–360.
  44.  “GPU Ocelot project: a dynamic compilation framework for GPU computing.” [Online]. Available: http://www.gpuocelot.gatech.edu/
  45.  J. Power, J. Hestness, M.S. Orr, M.D. Hill, and D.A. Wood, “gem5-gpu: A heterogeneous CPU-GPU simulator,” IEEE Comput. Archit. Lett., vol. 14, no. 1, pp. 34–36, 2015.
  46.  “FusionSim GPU simulator project.” [Online]. Available: https://sites.google.com/site/fusionsimulator/
  47.  A. Nakonechny and Z. Veres, “The wavelet based trained filter for image interpolation,” in Proc. IEEE 1st International Conference on Data Stream Mining & Processing, 2016, pp. 218–221.
  48.  G. Strang and T. Nguyen, Wavelets and Filter Banks. Welleslay, UK: Welleslay-Cambridge Press, 1996.
  49.  P. Lipiński and J. Stolarek, “Improving watermark resistance against removal attacks using orthogonal wavelet adaptation,” in Proc. 38th Conference on Current Trends in Theory and Practice of Computer Science, vol. 7147, 2012, pp. 588–599.
  50.  D. Bařina, M. Kula, and P. Zemčík, “Parallel wavelet schemes for images,” J. Real-Time Image Process., vol. 16, no. 5, pp. 1365–1381, 2019.
  51.  D. Bařina, M. Kula, M. Matýšek, and P. Zemčík, “Accelerating discrete wavelet transforms on GPUs,” in Proc. International Conference on Image Processing (ICIP), 2017, pp. 2707– 2710.
  52.  D. Bařina, M. Kula, M. Matýšek, and P. Zemčík, “Accelerating discrete wavelet transforms on parallel architectures,” J. WSCG, vol. 25, no. 2, pp. 77–85, 2017.
  53.  W. van der Laan, A. Jalba, and J. Roerdink, “Accelerating wavelet lifting on graphics hardware using CUDA,” IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 1, pp. 132–146, 2011.
  54.  M. Yatsymirskyy, “A novel matrix model of two channel biorthogonal filter banks,” Metody Informatyki Stosowanej, pp. 205–212, 2011.
  55.  M. Yatsymirskyy and K. Stokfiszewski, “Effectiveness of lattice factorization of two-channel orthogonal filter banks,” in Proc. Joint Conference NTAV/SPA, 2012, pp. 275–279.
  56.  M. Yatsymirskyy, “Lattice structures for synthesis and implementation of wavelet transforms,” J. Appl. Comput. Sci., vol. 17, no. 1, pp. 133–141, 2009.
  57.  J. Stolarek, “Adaptive synthesis of a wavelet transform using fast neural network,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 59, pp. 9– 13, 2011.
  58.  D. Puchala, K. Stokfiszewski, K. Wieloch, and M. Yatsymirskyy, “Comparative study of massively parallel GPU realizations of wavelet transform computation with lattice structure and matrixbased approach,” in Proc. IEEE International Conference on Data Stream Mining & Processing, 2018, pp. 88–93.
  59.  M. Harris, S. Sengupta, and J.D. Owens, “Parallel prefix sum (scan) with CUDA,” in GPU Gems 3, Part VI: GPU Computing, H. Nguyen, Ed. Addison Wesley, 2007, pp. 851–876.
  60.  S. Sengupta, A.E. Lefohn, and J.D. Owens, “A work-efficient step-efficient prefix sum algorithm,” in Proc. Workshop on Edge Computing Using New Commodity Architectures, 2006, pp. D–26–27.
  61.  J. Franco, G. Bernabe, J. Fernandez, and M.E. Acacio, “A parallel implementation of the 2d wavelet transform using CUDA,” in Proc. 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2009, pp. 111–118.
  62.  H. Bantikyan, “CUDA based implementation of 2-D discrete Haar wavelet transformation,” in Proc. International Conference Parallel and Distributed Computing Systems, 2014, pp. 20–26.
  63.  M.J. Flynn and S.F. Oberman, Advanced Computer Arithmetic Design. New York, NY, USA: John Wiley & Sons, Inc., 2001.
  64.  Ł. Napierała, “Effectiveness measurements of linear transforms realized on graphics processing units with the use of GPGPUSim emulator” – MSc thesis, Institute of Information Technology, Łódz´ University of Technology, Poland, 2020.

Date

25.02.2022

Type

Article

Identifier

DOI: 10.24425/bpasts.2021.139393
×