Revision as of 20:23, 17 February 2019

GPU (Graphics Processing Unit),
a specialized processor primarily intended for graphic cards to rapidly manipulate and alter memory for fast image processing, usually but not necessarily mapped to a framebuffer of a display. GPUs have more raw computing power than general purpose CPUs but need a limited, specialized and massive parallelized way of programming, not that conform with the serial nature of alpha-beta if it is about a massive parallel search in chess. Instead, Best-first Monte-Carlo Tree Search (MCTS) approaches in conjunction with deep learning proved a successful way to go on GPU architectures.

GPGPU

There are various frameworks for GPGPU, General Purpose computing on Graphics Processing Unit. Despite language wrappers and mobile devices with special APIs, there are in main three ways to make use of GPGPU.

Mapping to an API

BrookGPU (translates to OpenGL and DirectX)
C++ AMP (Open standard by Microsoft that extends C++)
DirectCompute (GPGPU API by Microsoft)

Native Compilers

CUDA (GPGPU framework by Nvidia)
OpenCL (Open Compute Language specified by Khronos Group)

Intermediate Languages

Inside

Modern GPUs consist of up to hundreds of SIMD or Vector units, coupled to compute units. Each compute unit processes multiple Warps (Nvidia term) resp. Wavefronts (AMD term) in SIMT fashion. Each Warp resp. Wavefront runs n (32 or 64) threads simultaneously.

The Nvidia GeForce GTX 580, for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. ^[2] The AMD Radeon HD 7970 is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. ^[3]. In real life the register and shared memory size limits the amount of total threads.

Memory

The memory hierarchy of an GPU consists in main of private memory (registers accessed by an single thread resp. work-item), local memory (shared by threads of an block resp. work-items of an work-group ), constant memory, different types of cache and global memory. Size, latency and bandwidth vary between vendors and architectures.

Here the data for the Nvidia GeForce GTX 580 (Fermi) as an example: ^[4]

128 KiB private memory per compute unit
48 KiB (16 KiB) local memory per compute unit (configurable)
64 KiB constant memory
8 KiB constant cache per compute unit
16 KiB (48 KiB) L1 cache per compute unit (configurable)
768 KiB L2 cache
1.5 GiB to 3 GiB global memory

Here the data for the AMD Radeon HD 7970 (GCN) as an example: ^[5]

256 KiB private memory per compute unit
64 KiB local memory per compute unit
64 KiB constant memory
16 KiB constant cache per four compute units
16 KiB L1 cache per compute unit
768 KiB L2 cache
3 GiB to 6 GiB global memory

Instruction Throughput

GPUs are used in HPC environments because of their good FLOP/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's Terascale, GCN), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, Radeon Pro, Radeon Instinct) and the specific model.

32 bit Integer Performance

The 32 bit integer performance can be architecture depended less than 32 bit FLOP or 24 bit integer performance.

64 bit Integer Performance

Current GPU registers and Vector-ALUs are 32 bit wide and have to emulate 64 bit integer operations.^[6] ^[7]

Mixed Precision Support

Newer architectures like Nvidia Turing and AMD Vega have mixed precision support. Vega doubles the FP16 and quadruples the INT8 throughput.^[8]Turing doubles the FP16 throughput of its FPUs.^[9]

TensorCores

With Nvidia Volta series TensorCores were introduced. They offer fp16*fp16+fp32, matrix-multiplication-accumulate-units, used to accelerate neural networks.^[10] Turings 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.^[11]

Throughput Examples

Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit ^[12]

   MAD 16
   MUL 16
   ADD 32
   Bit-shift 16
   Bitwise XOR 32

Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec

AMD Radeon HD 7970 (GCN 1.0) - 32 bit integer operations/clock cycle per processing element ^[13]

   MAD 1/4
   MUL 1/4
   ADD 1
   Bit-shift 1
   Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec

Deep Learning

GPUs are much more suited than CPUs to implement and train Convolutional Neural Networks (CNN), and were therefore also responsible for the deep learning boom, also affecting game playing programs combining CNN with MCTS, as pioneered by Google DeepMind's AlphaGo and AlphaZero entities in Go, Shogi and Chess using TPUs, and the open source projects Leela Zero headed by Gian-Carlo Pascutto for Go and its Leela Chess Zero adaption.

Publications

2009

Ren Wu, Bin Zhang, Meichun Hsu (2009). Clustering billions of data points using GPUs. ACM International Conference on Computing Frontiers
Mark Govett, Craig Tierney, Jacques Middlecoff, Tom Henderson (2009). Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models. CAS2K9 Workshop, pdf

2010...

Avi Bleiweiss (2010). Playing Zero-Sum Games on the GPU. NVIDIA Corporation, GPU Technology Conference 2010, slides as pdf
Mark Govett, Jacques Middlecoff, Tom Henderson (2010). Running the NIM Next-Generation Weather Model on GPUs. CCGRID 2010
Mark Govett, Jacques Middlecoff, Tom Henderson, Jim Rosinski, Craig Tierney (2011). Parallelization of the NIM Dynamical Core for GPUs. slides as pdf
Ľubomír Lackovič (2011). Parallel Game Tree Search Using GPU. Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, pdf
Dan Anthony Feliciano Alcantara (2011). Efficient Hash Tables on the GPU. Ph. D. thesis, University of California, Davis, pdf » Hash Table
Damian Sulewski (2011). Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks. Ph.D. thesis, University of Dortmund, pdf
Damjan Strnad, Nikola Guid (2011). Parallel Alpha-Beta Algorithm on the GPU. CIT. Journal of Computing and Information Technology, Vol. 19, No. 4 » Parallel Search, Reversi
Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (2012). A Node-based Parallel Game Tree Algorithm Using GPUs. CLUSTER 2012 » Parallel Search
S. Ali Mirsoleimani, Ali Karami, Farshad Khunjush (2013). A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments. GECCO '13
Diego Rodríguez-Losada, Pablo San Segundo, Miguel Hernando, Paloma de la Puente, Alberto Valero-Gomez (2013). GPU-Mapping: Robotic Map Building with Graphical Multiprocessors. IEEE Robotics & Automation Magazine, Vol. 20, No. 2, pdf
Qingqing Dang, Shengen Yan, Ren Wu (2014). A fast integral image generation algorithm on GPUs. ICPADS 2014

2015 ...

Peter H. Jin, Kurt Keutzer (2015). Convolutional Monte Carlo Rollouts in Go. arXiv:1512.03375 » Deep Learning, Go, MCTS
Liang Li, Hong Liu, Hao Wang, Taoying Liu, Wei Li (2015). A Parallel Algorithm for Game Tree Search Using GPGPU. IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 8 » Parallel Search
Sean Sheen (2016). Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1. Master's thesis, California Polytechnic State University, pdf ^[14] ^[15]
Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt (2016). gpucc: an open-source GPGPU compiler. CGO 2016
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis (2016). Mastering the game of Go with deep neural networks and tree search. Nature, Vol. 529 » AlphaGo
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815 » AlphaZero
Tristan Cazenave (2017). Residual Networks for Computer Go. IEEE Transactions on Computational Intelligence and AI in Games, Vol. PP, No. 99, pdf
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, Vol. 362, No. 6419

Forum Posts

2005 ...

Hardware assist by Nicolai Czempin, Winboard Forum, August 27, 2006
Monte carlo on a NVIDIA GPU ? by Marco Costalba, CCC, August 01, 2008

2010 ...

Using the GPU by Louis Zulli, CCC, February 19, 2010
GPGPU and computer chess by Wim Sjoho, CCC, February 09, 2011
Possible Board Presentation and Move Generation for GPUs? by Srdja Matovic, CCC, March 19, 2011

Re: Possible Board Presentation and Move Generation for GPUs by Steffan Westcott, CCC, March 20, 2011

Zeta plays chess on a gpu by Srdja Matovic, CCC, June 23, 2011 » Zeta
GPU Search Methods by Joshua Haglund, CCC, July 04, 2011
Possible Search Algorithms for GPUs? by Srdja Matovic, CCC, January 07, 2012 ^[16] ^[17]
uct on gpu by Daniel Shawul, CCC, February 24, 2012 » UCT
Is there such a thing as branchless move generation? by John Hamlen, CCC, June 07, 2012 » Move Generation
Choosing a GPU platform: AMD and Nvidia by John Hamlen, CCC, June 10, 2012
Nvidias K20 with Recursion by Srdja Matovic, CCC, December 04, 2012 ^[18]
Kogge Stone, Vector Based by Srdja Matovic, CCC, January 22, 2013 » Kogge-Stone Algorithm ^[19] ^[20]
GPU chess engine by Samuel Siltanen, CCC, February 27, 2013
Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013 » Perft, Kogge-Stone Algorithm ^[21]

2015 ...

GPU chess update, local memory... by Srdja Matovic, CCC, June 06, 2016
Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016 » Astro
Pigeon is now running on the GPU by Stuart Riffle, CCC, November 02, 2016 » Pigeon
Back to the basics, generating moves on gpu in parallel... by Srdja Matovic, CCC, March 05, 2017 » Move Generation
Re: Perft(15): comparison of estimates with Ankan's result by Ankan Banerjee, CCC, August 26, 2017 » Perft(15)
Chess Engine and GPU by Fishpov , Rybka Forum, October 09, 2017
To TPU or not to TPU... by Srdja Matovic, CCC, December 16, 2017 » Deep Learning ^[22]
Announcing lczero by Gary, CCC, January 09, 2018 » Leela Chess Zero
GPU ANN, how to deal with host-device latencies? by Srdja Matovic, CCC, May 06, 2018 » Neural Networks
My non-OC RTX 2070 is very fast with Lc0 by Kai Laskos, CCC, November 19, 2018 » Leela Chess Zero
LC0 using 4 x 2080 Ti GPU's on Chess.com tourney? by M. Ansari, CCC, December 28, 2018 » Leela Chess Zero
Generate EGTB with graphics cards? by Nguyen Pham, CCC, January 01, 2019 » Endgame Tablebases
LCZero FAQ is missing one important fact by Jouni Uski, CCC, January 01, 2019 » Leela Chess Zero

External Links

OpenCL

CUDA

CUDA from Wikipedia
CUDA Zone | NVIDIA Developer
Nvidia CUDA Compiler (NVCC) from Wikipedia
Compiling CUDA with clang — LLVM Clang documentation
CppCon 2016: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by Justin Lebar, YouTube Video ^[23]

:

Deep Learning

Deep Learning | NVIDIA Developer » Deep Learning
NVIDIA cuDNN | NVIDIA Developer
Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster
Deep Learning in a Nutshell: Core Concepts by Tim Dettmers, Parallel Forall, November 3, 2015
Deep Learning in a Nutshell: History and Training by Tim Dettmers, Parallel Forall, December 16, 2015
Deep Learning in a Nutshell: Sequence Learning by Tim Dettmers, Parallel Forall, March 7, 2016
Deep Learning in a Nutshell: Reinforcement Learning by Tim Dettmers, Parallel Forall, September 8, 2016
Faster deep learning with GPUs and Theano
Theano (software) from Wikipedia
TensorFlow from Wikipedia

Game Programming

GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper

Chess Programming

References

↑ Graphics processing unit - Wikimedia Commons
↑ CUDA C Programming Guide v7.0, Appendix G. COMPUTE CAPABILITIES, Table 12 Technical Specifications per Compute Capability
↑ AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
↑ CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES
↑ AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
↑ AMD Vega White Paper
↑ Nvidia Turing White Paper
↑ Vega (GCN 5th generation) from Wikipedia
↑ AnandTech - Nvidia Turing Deep Dive page 4
↑ INSIDE VOLTA
↑ AnandTech - Nvidia Turing Deep Dive page 6
↑ CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions
↑ AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths
↑ Jetson TK1 Embedded Development Kit | NVIDIA
↑ Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016
↑ Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2
↑ Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2
↑ Tesla K20 GPU Compute Processor Specifications Released | techPowerUp
↑ Parallel Thread Execution from Wikipedia
↑ NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf
↑ ankan-ban/perft_gpu · GitHub
↑ Tensor processing unit from Wikipedia
↑ Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019
↑ Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

Up one Level

[1] Graphics processing unit - Wikimedia Commons

[2] CUDA C Programming Guide v7.0, Appendix G. COMPUTE CAPABILITIES, Table 12 Technical Specifications per Compute Capability

[3] AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices

[4] CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES

[5] AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices

[6] AMD Vega White Paper

[7] Nvidia Turing White Paper

[8] Vega (GCN 5th generation) from Wikipedia

[9] AnandTech - Nvidia Turing Deep Dive page 4

[10] INSIDE VOLTA

[11] AnandTech - Nvidia Turing Deep Dive page 6

[12] CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions

[13] AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths

[14] Jetson TK1 Embedded Development Kit | NVIDIA

[15] Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016

[16] Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2

[17] Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2

[18] Tesla K20 GPU Compute Processor Specifications Released | techPowerUp

[19] Parallel Thread Execution from Wikipedia

[20] NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf

[21] -ban/perft_gpu · GitHub

[22] Tensor processing unit from Wikipedia

[23] Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019

[24] Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

@@ Line 56: / Line 56: @@
 * 64 bit Integer Performance
-: Current GPU [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] are 32 bit wide and have to emulate 64 bit integer operations.<ref>[https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf |AMD Vega White Paper]</ref> <ref>[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Nvidia Turing White Paper]</ref>
+: Current GPU [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] are 32 bit wide and have to emulate 64 bit integer operations.<ref>[https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf AMD Vega White Paper]</ref> <ref>[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Nvidia Turing White Paper]</ref>
+* Mixed Precision Support
+: Newer architectures like Nvidia [https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] and AMD [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega] have mixed precision support. Vega doubles the [https://en.wikipedia.org/w/index.php?title=FP16&redirect=no FP16] and quadruples the [https://en.wikipedia.org/wiki/Integer_(computer_science)#Common_integral_data_types INT8] throughput.<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next#fifth Vega (GCN 5th generation) from Wikipedia]</ref>Turing doubles the FP16 throughput of its [https://en.wikipedia.org/wiki/Floating-point_unit FPUs].<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4   AnandTech - Nvidia Turing Deep Dive page 4]</ref>
 * TensorCores
-: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer fp16*fp16+fp32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref>
+: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer fp16*fp16+fp32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turings 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6  AnandTech - Nvidia Turing Deep Dive page 6]</ref>
-* Mixed Precision Support
-: Newer architectures like Nvidia [https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] and AMD [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega] have mixed precision support. Vega doubles the [https://en.wikipedia.org/w/index.php?title=FP16&redirect=no FP16] and quadruples the [https://en.wikipedia.org/wiki/Integer_(computer_science)#Common_integral_data_types INT8] throughput.<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next#fifth Vega (GCN 5th generation)]</ref>Turing doubles the FP16 throughput of its [https://en.wikipedia.org/wiki/Floating-point_unit FPUs] and its 2nd gen TensorCores support now FP16, INT8, INT4 optimized computation. <ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6  AnandTech - Nvidia Turing Deep Dive]</ref>
 ==Throughput Examples==

Difference between revisions of "GPU"

Revision as of 20:23, 17 February 2019

Contents

GPGPU

Mapping to an API

Native Compilers

Intermediate Languages

Inside

Memory

Instruction Throughput

Throughput Examples

Deep Learning

See also

Publications

2009

2010...

2015 ...

Forum Posts

2005 ...

2010 ...

2015 ...

External Links

OpenCL

CUDA

Deep Learning

Game Programming

Chess Programming

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools