
Jump to: navigation, search


19,368 bytes added, 24 January
2020 ...: link to Chinese gpus
'''[[Main Page|Home]] * [[Hardware]] * GPU'''
[[FILE:6600GT GPUNvidiaTesla.jpg|border|right|thumb| [ GeForce 6600GT (NV43)Nvidia_Tesla Nvidia Tesla] GPU <ref>[ Graphics processing unit - File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [ Wikimedia Commons]</ref> ]]
'''GPU''' (Graphics Processing Unit),<br/>
a specialized processor primarily initially intended to for fast [ image processing]. GPUs may have more raw computing power than general purpose [ CPUs] but need a specialized and massive parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.
=History=In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [ TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [ GTIA]+[ ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [ Denise]+[ Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as [ SGI Impact] (1995) in 3D graphics-workstations or [ 3dfx Voodoo] (1996) for playing 3D games on PCs, emerged. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [ real-time rendering]. Sony's 3D capable chip [ GTE] used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like [ NV1] (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [ unified shader architecture], like in Nvidia [ Tesla] (2006), ATI/AMD [ TeraScale] (2007) or Intel [ GMA X3000] (2006), GPGPUframeworks like [ CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity. =GPU in Computer Chess=  There are in main four ways how to use a GPU for chess:
The traditional job of * As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU is to take * Offload the search in [[httpsZeta|Zeta]]:// x,y,z coordinates] of run a parallel game tree search with move generation and position evaluation on GPU* As a hybrid in [httpshttp://enwww.wikipediatalkchess.orgcom/wikiforum3/Triangle_strip trianglesviewtopic.php?t=64983&start=4#p729152 perft_gpu], : expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree* Neural network training such as [https://engithub.wikipedia.orgcom/wikiglinscott/3D_projection mapnnue-pytorch Stockfish NNUE trainer in Pytorch] these triangles to <ref>[httpshttp://enwww.wikipediatalkchess.orgcom/wikiforum3/Glossary_of_computer_graphics#screen_space screen spaceviewtopic.php?f=7&t=75724 Pytorch NNUE training] by [[Gary Linscott]], [[CCC]] through a , November 08, 2020</ref> or [https://engithub.wikipedia.orgcom/wikiLeelaChessZero/Matrix_multiplication matrix multiplicationlczero-training Lc0 TensorFlow Training]. As video game graphics grew more sophisticated, the number of triangles per scene grew larger. GPUs similarly grew in size to massively parallel behemoths capable of performing billions of transformations hundreds of times per second.
These lists of triangles were specified in Graphics APIs like =GPU Chess Engines=* [[ DirectX]. But video game programmers demanded more flexibility from their hardware: such as lighting, transparency, and reflections. This flexibility was granted with specialized programming languages, called [httpsCategory:// vertex shadersGPU] or [ pixel shaders].
Eventually, the fixed-functionality of GPUs disappeared, and GPUs became primarily =GPGPU=  Early efforts to leverage a massively parallel GPU for general -purpose computerscomputing required reformulating computational problems in terms of graphics primitives via graphics APIs like [ OpenGL] or [https://en. Instead of using vertex shaders inside of], general compute languages are designed to make sense outside of a graphical settingfollowed by first GPGPU frameworks such as [ Sh/RapidMind] or [ Brook] and finally [ CUDA] and [ OpenCL].
== Khronos OpenCL ==
[[OpenCL|OpenCL]] specified by the [ Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.
The * [ Khronos group] is a committee formed to oversee the [ OpenGL], [[OpenCL]], and [https://en.wikipedia.orgadopters/wikiconformant-products/Vulkan_(API) Vulkan] standards. Although compute shaders exist in all languages, OpenCL is the designated general purpose compute language. opencl List of OpenCL 1.2 is widely supported by [[AMDConformant Products]], [[Nvidia|NVidia]], and [[Intel]]. OpenCL 2.0, although specified in 2013, has had a slow rollout, and the specific features aren't necessarily widespread in modern GPUs yet. AMD continues to target OpenCL 2.0 support in their ROCm environment, while NVidia has implemented some OpenCL 2.0 features.
* [ OpenCL 1.2 Specification]
* [ OpenCL 2.0 Reference]
== NVidia Software overview ==* [ OpenCL 3.0 Specifications]
[[Nvidia|NVidia]] [ CUDA] is their general purpose compute framework. CUDA has a [[Cpp|C++]] compiler based on [ LLVM] / [ clang], which compiles into an assembly-like language called [ PTX]. NVidia device drivers take PTX and compile that down to the final machine code (called NVidia SASS). NVidia keeps PTX portable between its GPUs, while its SASS assembly language may change from year-to-year as NVidia releases new GPUs. A defining feature of CUDA was the "single source" C++ compiler, the same compiler would work with both CPU host-code and GPU device-code. This meant that the data-structures and even pointers from the CPU can be shared directly with the GPU code.== AMD ==
* [https://developer[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload NVidia CUDA Zone]* It offers with [ NVidia PTX ISAROCm]* [ own parallel compute platform.html NVidia CUDA Toolkit Documentation]
== AMD Software Overview == [[AMD|AMD's]] original software stack, called * [https://encommunity.wikipediaamd.orgcom/wikit5/AMDGPU AMDGPUopencl/bd-pro], provides OpenCL 1.2 and 2.0 capabilities on [[Linux]] and [[Windows]]. However, most of AMD's efforts today is on an experimental framework called [https:p// ROCm]. ROCm is opencl-discussions AMD's open source compiler and device driver stack intended for general purpose compute. ROCm supports two languages: [ HIP] (a CUDA-like single-source C++ compiler also based on LLVM/clang), and OpenCL 2.0. ROCm only works on Linux machines supporting modern hardware, such as [ PCIe 3.0Developer Community] and relatively recent GPUs (such as the * [ RX 580], and [https:/com/ Vega] GPUs)indexhtml AMD regularly publishes the assembly language details of their architectures. Their "GCN Assembly" changes slightly from generation to generation, but the fundamental principles have remained the same. AMD's OpenCL ROCm™ documentation, especially the "OpenCL Programming Guide" and the "Optimization Guide" are good places to start for beginners looking to program their GPUs. For Linux developers, the ROCm environment is under active development and has enough features to get code working well.]* [ ROCm Homepage]* [ contents AMD OpenCL Programming Guide]
* [ AMD OpenCL Optimization Guide]
* [ RDNA Instruction Set]* [ Vega Instruction SetAMD GPU ISA documentation== Other 3rd party tools ==  * [ DirectCompute] (GPGPU API by Microsoft)* OpenMP 4.5 Device Offload =The Implicitly Parallel SIMT Programming Model= CUDA, OpenCL, ROCm HIP, all have the same model of implicitly parallel programming. All threads are given an identifier: a threadIdx in CUDA or local_id in OpenCL. Aside from this index, all threads of a kernel will execute the same code. The only way to alter the behavior of code is to use this threadIdx to access different data. The executed code is always implicitly [[SIMD]]. Instead of thinking of SIMD-lanes, each lane is considered its own thread. The smallest group of threads is called a CUDA Warp, or OpenCL Wavefront. NVidia GPUs execute 32-threads per warp, while AMD GCN GPUs execute 64-threads per wavefront. All threads within a Warp or Wavefront share an instruction pointer. Consider the following CUDA code:
if(threadIdx.x == 0){Apple == doASince macOS 10.14 Mojave a transition from OpenCL to [; } else { doB(); }Metal] is recommended by [[Apple]].
While there is only one thread in the warp that has threadIdx == 0, all 32 threads of the warp will have their shared instruction pointer execute doA() together* [https://developer. To keep the code semantically correct, threads #1 through #31 will have their NVidia Apple OpenCL Developer] * [ Apple Metal Developer]* [ Apple Metal Programming Guide]* [ cleared (or AMD Execution Mask cleared), which means the thread will throw away the work after executing a specific statement. For those familiar with x64 AVX code, a GPU thread is comparable to a SIMDShading-Language-lane in AVX. All lanes of an AVX instruction will execute any particular instruction, but you may throw away the results of some registers using mask or comparison instructionsSpecification.pdf Metal Shading Language Specification]
Once doA() is complete, == Intel ==Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the machine will continue and doB[] platform with [https://en. In this case, thread#0 will have its execution mask-cleared, while threads #1 through #31 will actually complete the results of doB() DPC++] as frontend language.
This highlights the fundamental trade off of the GPU platform* [https://www. GPUs have many threads of execution, but they are forced to execute with their warps or wavefrontsintel. In complicated loops or trees of ifcom/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]* [, this thread divergence problem can cause your code to potentially leave many hardware threads idleguide/top.html Intel oneAPI Programming Guide]
== Building up to larger thread groups Nvidia ==
The GPU hardware will execute entire warps or wavefronts at a time[https://en. Anything less than 32-threads will force some SIMD-threads to idlewikipedia. As suchorg/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, high-performance programmers should try to schedule as many full-warps or wavefronts as possibleOpenCL and offload directives via [ OpenACC] and [ OpenMP].
Programmers can group warps or wavefronts together into larger clusters, called * [ Nvidia CUDA Zone]* [ Nvidia PTX ISA]* [ Nvidia CUDA Blocks or OpenCL WorkgroupsToolkit Documentation]* [https://docs. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resourcesnvidia. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficientlycom/cuda/cuda-c-programming-guide/index. Case in pointhtml Nvidia CUDA C++ Programming Guide]* [https: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup//docs.nvidia. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these casescom/cuda/cuda-c-best-practices-guide/index.html Nvidia CUDA C++ Best Practices Guide]
Workgroups are not the end == Further == * [ Vulkan] (OpenGL sucessor of scaling howeverKhronos Group)* [https://en. GPUs can support many workgroups to execute in parallelwikipedia. AMD Vega Compute Units org/wiki/DirectCompute DirectCompute] (CUsMicrosoft) can schedule 40 wavefronts per CU * [ C++ AMP] (although it only physically executes 4 wavefronts concurrentlyMicrosoft), and has 64 CUs available on a Vega64 GPU* [https://en.wikipedia. AMD Vega64 org/wiki/OpenACC OpenACC] (Vegaoffload directives) Summary* [https: 64 Threads per Wavefront//en. 1 to 16 Wavefronts per Workgroupwikipedia. With 64 CUs supporting 40 wavefronts, a total of 2560 wavefronts org/wiki/OpenMP OpenMP] (163,840 threadsoffload directives) can be loaded per AMD Vega64.
NVidia has a similar language and mechanism. NVidia GPUs can support many blocks to execute in parallel. NVidia Symmetric Multiprocessors can schedule 32 warp per SM (although it can only physically execute 1 warp at a time). With 40 SMs available on a RTX 2070. NVidia RTX 2070 (Turing) Summary: 32 Threads per Warp. 1 to 32 Warps per Block. With 40 SMs, each supporting 32 warps, a total of 1280 warps (40,960 threads) can be scheduled per RTX 2070.=Hardware Model=
The challenge of GPU Compute Languages A common scheme on GPUs with unified shader architecture is to provide the programmer run multiple threads in [,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the flexibility same [ SIMD] unit to take advantage of hide memory optimizations at the CUDA Block or OpenCL Workgroup level latencies. Multiple processing elements (~1024 threadsGPU cores)are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, while still being able with up to specify hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the tensFPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [ hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable ofand MMAC units (matrix-thousands of physical threads working on the typical GPUmultiply-accumulate units) are used to speed up neural networks further.
{| class= Architectures and Physical Hardware "wikitable" style="margin:auto"|+ Vendor Terminology|-! AMD Terminology !! Nvidia Terminology|-| Compute Unit || Streaming Multiprocessor|-| Stream Core || CUDA Core|-| Wavefront || Warp|}
Each generation, the manufacturers create a series of cards, with set vRAM and SIMD Cores. The market is split into three categories: server, professional, and consumer. Consumer cards are cheapest and are primarily targeted for the video game market. Professional cards have better driver support for 3d programs. Finally, server cards provide virtualization services, allowing cloud companies to virtually split their cards between customers.===Hardware Examples===
While server and professional cards have more vRAM, consumer cards are best for starting GPU programmingNvidia GeForce GTX 580 ([ Fermi]) <ref>[ Fermi white paper from Nvidia]</ref><ref>[ GeForce 500 series on Wikipedia]</ref>
GPUs use high* 512 CUDA cores @1.544GHz* 16 SMs -bandwidth RAM, such as GDDR6 or HBM2. These specialized RAM are designed for the extremely parallel nature Streaming Multiprocessors* organized in 2x16 CUDA cores per SM* Warp size of GPUs, and can provide 200GBps to 1000GBps throughput. In comparison: a typical DDR4 channel can provide 20GBps. A dual channel desktop will typically have under 50GBps bandwidth to DDR4 main memory.32 threads
== NVidia ==AMD Radeon HD 7970 ([ GCN)]<ref>[ Graphics Core Next on Wikipedia]</ref><ref>[ Radeon HD 7000 series on Wikipedia]</ref>
NVidia's consumer line of cards is Geforce* 2048 Stream cores @0.925GHz* 32 Compute Units* organized in 4xSIMD16, each SIMT4, branded with RTX or GTX labels. Nvidia's professional line per Compute Unit* Wavefront size of cards is Quadro. Finally, Tesla cards constitute NVidia's server line.64 work-items
NVidia's "Titan" line ===Wavefront and Warp===Generalized the definition of Geforce cards are technically consumer cards, but internally are using professional or server class chips. As such, the Titan line can cost anywhere from $1000 to $3000 per cardWavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.
=== Turing Architecture ==Programming Model=
Turing cards were first released A [ parallel programming model] for GPGPU can be [ data-parallel], [ task-parallel], a mixture of both, or with libraries and offload-directives also [ implicitly-parallel]. Single GPU threads (work-items in 2018. They OpenCL) contain the kernel to be computed and are the first consumer cores coupled to launch with RTXa work-group, one or raytracingmultiple work-groups form the NDRange to be executed on the GPU device. The members of a work-group execute the same kernel, features. RTX instructions will more quickly traverse an aabb tree can be usually synchronized and have access to discover raythe same scratch-intersections pad memory, with lists an architecture limit of objects. These are also how many work-items a work-group can hold and how many threads can run in total concurrently on the first consumer cards to launch with Tensor cores, 4x4 matrix multiplication FP16 instructions to accelerate convolutional neural networksdevice.
* RTX 2080 Ti* RTX 2080* RTX 2070 Ti* RTX 2070 Super* RTX 2060 Super* RTX 2060* GTX 1660 === Volta Architecture ={| class="wikitable" style= "margin:auto"|+ TerminologyVolta cards were released in 2018. Only Tesla and Titan cards were produced in this generation, constituting the highest end of the market. They were the first cards to launch with Tensor cores, supporting 4x4 FP16 matrix multiplications to accelerate convolutional neural networks.|-! OpenCL Terminology !! CUDA Terminology* Tesla V100|-* Titan V| Kernel || Kernel|-=== Pascal Architecture == =| Compute Unit || Streaming Multiprocessor|-Pascal cards were first released in 2016.| Processing Element || CUDA Core|-* GTX 1080 Ti| Work-Item || Thread* GTX 1080|-* GTX 1070 Ti| Work-Group || Block* GTX 1060|-* GTX 1050| NDRange || Grid* GTX 1030|- == AMD ==|}
== RDNA 1.0 Thread Examples==
RDNA cards were first released in 2019Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en. RDNA is a major change for AMD cards: the underlying hardware supports both Wave32 and Wave64 gangs of threadswikipedia. org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref>
* 5700 XTWarp size: 32* 5700Maximum number of threads per block: 1024* Maximum number of resident blocks per multiprocessor: 32* Maximum number of resident warps per multiprocessor: 64* Maximum number of resident threads per multiprocessor: 2048
== Vega GCN 5th gen ==
Vega cards were first released in 2017AMD Radeon HD 7970 (GCN) <ref>[ AMD GPU Hardware Basics]</ref>
* Radeon VIIWavefront size: 64* Vega64Maximum number of work-items per work-group: 1024* Vega56Maximum number of work-groups per compute unit: 40* MI25Maximum number of Wavefronts per compute unit: 40* Maximum number of work-items per compute unit: 2560
== Polaris GCN 4th gen =Memory Model=
* RX 580* RX 570* RX 560OpenCL offers the following memory model for the programmer:
=Inside= Modern GPUs consist of up to hundreds of [[SIMD and SWAR Techniques|SIMD]] or [https://en.wikipedia* __private - usually registers, accessable only by a single work-item Vector] units, coupled to compute unitsthread. Each compute unit processes multiple [ Warps] (Nvidia term) * __local - scratch-pad memory shared across work-items of a work-group resp. Wavefronts ([[AMD]] term) in [https://enthreads of block.wikipedia* __constant - read-only* __global - usually VRAM,_multiple_threads SIMT] fashion. Each Warp accessable by all work-items resp. Wavefront runs n (32 or 64) [[Thread|threads]] simultaneously.
The Nvidia [https{| class="wikitable" style="margin:// GeForce GTX 580], for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. <ref>auto"|+ Terminology|-! OpenCL Terminology !! CUDA C Programming Guide v7.0, Appendix G. COMPUTE CAPABILITIES, Table 12 Technical Specifications per Compute Capability</ref>Terminology|-| Private Memory || Registers|-| Local Memory || Shared Memory|-| Constant Memory || Constant MemoryThe AMD [ Radeon HD 7970] is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>. In real life the register and shared memory size limits the amount of total threads.|-| Global Memory || Global Memory|}
===MemoryExamples=== The memory hierarchy of an GPU consists in main of private memory (registers accessed by an single thread resp. work-item), local memory (shared by threads of an block resp. work-items of an work-group ), constant memory, different types of cache and global memory. Size, latency and bandwidth vary between vendors and architectures.
Here the data for the Nvidia GeForce GTX 580 ([ Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>
* 128 KiB private memory per compute unit
* 48 KiB (16 KiB) local memory per compute unit (configurable)
* 8 KiB constant cache per compute unit
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)
* 768 KiB L2 cachein total
* 1.5 GiB to 3 GiB global memory
Here the data for the AMD Radeon HD 7970 ([ GCN]) as an example: <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>
* 256 KiB private memory per compute unit
* 64 KiB local memory per compute unit
* 16 KiB constant cache per four compute units
* 16 KiB L1 cache per compute unit
* 768 KiB L2 cachein total
* 3 GiB to 6 GiB global memory
===Unified Memory===
Usually data has to be copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.
=Instruction Throughput=
GPUs are used in [ HPC] environments because of their good [ FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [ Tesla], [ Fermi], [ Kepler], [ Maxwell] or AMD's [ TerascaleTeraScale], [ GCN], [ RDNA]), the brand (like Nvidia [ GeForce], [ Quadro], [ Tesla] or AMD [ Radeon], [ Radeon Pro], [ Radeon Instinct]) and the specific model. ==Integer Instruction Throughput==* INT32: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.
* 32 bit Integer Performance INT64: The In general [ registers] and Vector-[ ALUs] of consumer brand GPUs are 32 -bit integer performance can be architecture wide and operation depended less than 32 have to emulate 64-bit FLOP integer operations.* INT8: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or 24 bit integer performanceoctuple the INT4 throughput.
* 64 bit Integer Performance: Current GPU [ registers] and Vector==Floating-[ ALUs] are 32 bit wide and have to emulate 64 bit integer operations.<ref>[ AMD Vega White Paper]</ref> <ref>[ Nvidia Turing White Paper]</ref>Point Instruction Throughput==
* Mixed Precision SupportFP32: Newer architectures like Nvidia [ Turing] and AMD [ Vega] have mixed Consumer GPU performance is measured usually in single-precision support. Vega doubles the [ FP16] and quadruples the [ INT8] throughput.<ref>[ Vega floating-point FMA (GCN 5th generationfused-multiply-add) from Wikipedia]</ref>Turing doubles the FP16 throughput of its [ FPUs].<ref>[ AnandTech - Nvidia Turing Deep Dive page 4]</ref>
* TensorCoresFP64: With Nvidia [httpsConsumer GPUs have in general a lower ratio (FP32:// Volta] series TensorCores were introduced. They offer fp16*fp16+fp32, matrixfor double-multiplicationprecision (64-accumulatebit) floating-units, used to accelerate neural networkspoint operations throughput than server brand GPUs.<ref>[https * FP16://onSome GPGPU architectures offer (16-lukebit) floating-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turings 2nd gen TensorCores add point operation throughput with an FP32:FP16, INT8, INT4 optimized computation.<ref>[httpsratio of 1:// AnandTech - Nvidia Turing Deep Dive page 6]</ref>
==Throughput Examples==
 Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 -bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref>
MAD 16
Bitwise XOR 32
Max theoretic ADD operation throughput: 32 Ops * x 16 CUs * x 1544 MHz = 790.528 GigaOps/sec
AMD Radeon HD 7970 (GCN 1.0) - 32 -bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref>
MAD 1/4
Bitwise XOR 1
Max theoretic ADD operation throughput: 1 Op * x 2048 PEs * x 925 MHz = 1894.4 GigaOps/sec  =Tensors=MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[ Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit. ==Nvidia TensorCores==: With Nvidia [ Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[ INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[ AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[ Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[ - Ada Lovelace microarchitecture]</ref> ==AMD Matrix Cores==: AMD released 2020 its server-class [ CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation acceleration. AMD's CDNA 3 architecture adds support for FP8 and sparse matrix data (sparsity). ==Intel XMX Cores==: Intel added XMX, Xe Matrix eXtensions, cores to some of the [ Intel Xe] GPU series, like [ Arc Alchemist] and [ Intel Data Center GPU Max Series].
=Host-Device Latencies=
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an a measurable latency for null-kernels of 5 microseconds <ref>[ host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[ host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[ Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.
=Deep Learning=
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [ TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.
= Architectures =
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.
== AMD ==
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.
* [ List of AMD graphics processing units on Wikipedia]
=== CDNA3 ===
CDNA3 HPC architecture was unveiled in December, 2023. With MI300A APU model (CPU+GPU+HBM) and MI300X GPU model, both with multi-chip modules design. Featuring Matrix Cores with support for a broad type of precision, as INT8, FP8, BF16, FP16, TF32, FP32, FP64, as well as sparse matrix data (sparsity). Supported by AMD's ROCm open software stack for AMD Instinct accelerators.
* [ AMD CDNA3 Whitepaper]
* [ AMD Instinct MI300/CDNA3 Instruction Set Architecture]
* [ AMD ROCm Developer Hub]
=== Navi 3x RDNA3 ===
RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation acceleration.
GPUs were originally intended to process matrix multiplications for graphical transformations and rendering* [https://en.wikipedia. org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]* [[Neural Networks#Convolutional|Convolutional Neural Networks]] can have their operations interpreted as a series of matrix multiplicationshttps://developer.amd. GPUs are therefore a natural fit to parallelize and process CNNscom/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf RDNA3 Instruction Set Architecture]
GPUs traditionally operated on 32=== CDNA2 === CDNA2 architecture in MI200 HPC-bit floating point numbers. However, CNNs can make due GPU with 16-bit half floats optimized FP64 throughput (FP16matrix and vector), or even 8-bit or 4-bit numbers. One thousand single-precision floats will take up 4kB of space, while onemulti-thousand FP16 will take up 2kB of space. A halfchip-float uses half the memory, eats only half the memory bandwidth, module design and only half the space Infinity Fabric was unveiled in caches. As suchNovember, GPUs such as AMD Vega or NVidia Volta added support for FP16 processing2021.
Specialized units, such as NVidia Volta's "Tensor cores", can perform an entire 4x4 block of FP16 matrix multiplications in just one PTX assembly language statement* [https://www. It is with these instructions that CNN operations are acceleratedamd. com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]* [ CDNA2 Instruction Set Architecture]
GPUs are much more suited than CPUs to implement === CDNA === CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020. * [ AMD CDNA Whitepaper]* [ CDNA Instruction Set Architecture] === Navi 2x RDNA2 === [ RDNA2] cards were unveiled on October 28, 2020. * [ AMD Radeon RX 6000 on Wikipedia]* [ RDNA 2 Instruction Set Architecture] === Navi RDNA === [ RDNA] cards were unveiled on July 7, 2019. * [ RDNA Whitepaper]* [ Architecture Slide Deck]* [ RDNA Instruction Set Architecture] === Vega GCN 5th gen === [ Vega] cards were unveiled on August 14, 2017. * [ Architecture Whitepaper]* [ Vega Instruction Set Architecture] === Polaris GCN 4th gen ===  [ Polaris] cards were first released in 2016. * [ Architecture Whitepaper]* [ GCN3/4 Instruction Set Architecture] === Southern Islands GCN 1st gen === Southern Island cards introduced the [ GCN] architecture in 2012. * [ AMD Radeon HD 7000 on Wikipedia]* [ Southern Islands Programming Guide]* [ Southern Islands Instruction Set Architecture] == Apple == === M series === Apple released its M series SoC (system on a chip) with integrated GPU for desktops and train notebooks in 2020. * [ Apple M series on Wikipedia] == ARM ==The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered. * [Neural Networks|Convolutional Neural NetworksVariants Mali variants on Wikipedia] === Valhall (2019) === * [ Bifrost and Valhall OpenCL Developer Guide] === Bifrost (2016) === * [ Bifrost and Valhall OpenCL Developer Guide] === Midgard (2012) ===* [ Midgard OpenCL Developer Guide== Intel == === Xe === [ Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (CNNhigh-performance-gaming), Xe-HP (high-performace) and were therefore also responsible Xe-HPC (high-performance-computing). * [ List of Intel Gen12 GPUs on Wikipedia]* [ Arc Alchemist series on Wikipedia] ==Nvidia==Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server. * [ List of Nvidia graphics processing units on Wikipedia] === Grace Hopper Superchip ===The Nvidia GH200 Grace Hopper Superchip was unveiled August, 2023 and combines the Nvidia Grace CPU ([[Deep LearningARM|deep learningARM v9]]) and Nvidia Hopper GPU architectures via NVLink to deliver a CPU+GPU coherent memory model for accelerated AI and HPC applications. * [ NVIDIA Grace Hopper Superchip Data Sheet]* [ NVIDIA Grace Hopper Superchip Architecture Whitepaper=== Ada Lovelace Architecture ===The [ Ada Lovelace microarchitecture] boomwas announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.also affecting game playing programs combining CNN with * [ Ada GPU Whitepaper]* [Monte Tree Search|MCTStuning-guide/index.html Ada Tuning Guide=== Hopper Architecture ===The [ Hopper GPU Datacenter microarchitecture]was announced on March 22, as pioneered by 2022, featuring Transfomer Engines for large language models. * [ Hopper H100 Whitepaper]* [Google Hopper Tuning Guide=== Ampere Architecture ===The [ Ampere microarchitecture] was announced on May 14, 2020 <ref>[[DeepMind NVIDIA Ampere Architecture In-Depth |DeepMind'sNVIDIA Developer Blog]by [ Ronny Krashinsky] , [ Olivier Giroux], [AlphaGo Stephen Jones], [ Nick Stam] and [ Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [AlphaZero Pramod Ramarao], May 14, 2020</ref>. * [ Ampere GA100 Whitepaper] entities in * [ Ampere GA102 Whitepaper]* [ Ampere GPU Architecture Tuning Guide] === Turing Architecture ===[Go Turing]cards were first released in 2018. They are the first consumer cores to launch with RTX, for [ raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[ShogiNeural Networks#Convolutional|convolutional neural networks]] and . The Turing GTX line of chips do not offer RTX or TensorCores. * [ Turing Architecture Whitepaper]* [Chess Turing Tuning Guide]] using  === Volta Architecture === [ TPUsVolta_(microarchitecture) Volta]cards were released in 2017. They were the first cards to launch with TensorCores, and the open source projects supporting matrix multiplications to accelerate [[Leela ZeroNeural Networks#Convolutional|convolutional neural networks]] headed by . * [ Volta Architecture Whitepaper]* [ Volta Tuning Guide] === Pascal Architecture ===[ Pascal] cards were first released in 2016. * [ Pascal Architecture Whitepaper]* [Gian Pascuttoguide/index.html Pascal Tuning Guide=== Maxwell Architecture ===[ Maxwell] for cards were first released in 2014. * [ Maxwell Architecture Whitepaper on]* [Go Maxwell Tuning Guide== PowerVR ==PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available. === PowerVR === * [ PowerVR series on Wikipedia] and its  === IMG === * [ IMG A series on Wikipedia]* [Leela Chess Zero IMG B series on Wikipedia== Qualcomm ==Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered. === Adreno ===* [ Adreno variants on Wikipedia] adaption == Vivante Corporation ==Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support. === GC-Series === * [ GC series on Wikipedia]
=See also=
* [[Deep Learning]]
** [[AlphaGo]]
** [[AlphaZero]]
** [[Neural Networks#Convolutional|Convolutional Neural Networks]]
** [[Leela Zero]]
** [[Leela Chess Zero]]
* [[FPGA]]
* [[Graphics Programming]]
* [[SIMD and SWAR Techniques]]
* [[Thread]]
* [[Zeta]]
 ==20091986==* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[ Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism==1990==* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[ Vector Models for Data-Parallel Computing]''. [ MIT Press], [ pdf]==2008 ...==* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [ ShaderX6 - Advanced Rendering Techniques] <ref>[ Photon mapping from Wikipedia]</ref>
* [[Ren Wu]], [ Bin Zhang], [ Meichun Hsu] ('''2009'''). ''[ Clustering billions of data points using GPUs]''. [ ACM International Conference on Computing Frontiers]
* [ Mark Govett], [ Craig Tierney], [[Jacques Middlecoff]], [ Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [ CAS2K9 Workshop]
* [[Hank Dietz]], [ Bobby Dalton Young] ('''2009'''). ''[ MIMD Interpretation on a GPU]''. [ LCPC 2009], [ pdf], [ slides.pdf]
* [ Sander van der Maar], [[Joost Batenburg]], [ Jan Sijbers] ('''2009'''). ''[ Experiences with Cell-BE and GPU for Tomography]''. [ SAMOS 2009] <ref>[ Cell (microprocessor) from Wikipedia]</ref>
* [ Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [ NVIDIA Corporation], [ GPU Technology Conference 2010], [ slides as pdf]
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [ pdf]
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[ Parallel Alpha-Beta Algorithm on the GPU]''. [ CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]]
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [ Balázs Tóth], [ Eurographics 2011], [ pdf]
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[ A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]
* [ Ali Karami], [[S. Ali Mirsoleimani]], [ Farshad Khunjush] ('''2013'''). ''[ A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [ CADS 2013]
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [ Paloma de la Puente], [ Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [ IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [ pdf]
* [ David Williams], [[Valeriu Codreanu]], [ Po Yang], [ Baoquan Liu], [ Feng Dong], [ Burhan Yasar], [ Babak Mahdian], [ Alessandro Chiarini], [ Xia Zhao], [ Jos Roerdink] ('''2013'''). ''[ Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [ PPAM 2013]
* [ Qingqing Dang], [ Shengen Yan], [[Ren Wu]] ('''2014'''). ''[ A fast integral image generation algorithm on GPUs]''. [ ICPADS 2014]
* [[S. Ali Mirsoleimani]], [ Ali Karami Ali Karami], [ Farshad Khunjush] ('''2014'''). ''[ A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [ ARCS 2014], [ Lecture Notes in Computer Science], Vol. 8350, [ Springer]
* [[Steinar H. Gunderson]] ('''2014'''). ''[ Movit: High-speed, high-quality video filters on the GPU]''. [ FOSDEM] [ 2014], [ pdf]* [ Baoquan Liu], [ Alexandru Telea], [ Jos Roerdink], [ Gordon Clapworthy], [ David Williams], [ Po Yang], [ Feng Dong], [[Valeriu Codreanu]], [ Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [ Computers & Graphics], Vol. 41, [ pdf]
==2015 ...==
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [ arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[ A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]
* [[Simon Portegies Zwart]], [ Jeroen Bédorf] ('''2015'''). ''[ Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11
* <span id="Astro"></span>[ Sean Sheen] ('''2016'''). ''[ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [ California Polytechnic State University], [ pdf] <ref>[ Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[ Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref>
* [ Jingyue Wu], [ Artem Belevich], [ Eli Bendersky], [ Mark Heffernan], [ Chris Leary], [ Jacques Pienaar], [ Bjarke Roune], [ Rob Springer], [ Xuetian Weng], [ Robert Hundt] ('''2016'''). ''[ gpucc: an open-source GPGPU compiler]''. [ CGO 2016]
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[ Mastering the game of Go with deep neural networks and tree search]''. [ Nature], Vol. 529 » [[AlphaGo]]
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[ Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[ PowerVR from Wikipedia]</ref> [[IEEE]] [ 20th Jubilee International Conference on Intelligent Engineering Systems]
* [[Diogo R. Ferreira]], [ Rui M. Santos] ('''2016'''). ''[ Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [ BPM 2016]
* [ Ole Schütt], [ Peter Messmer], [ Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[ GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [ pdf] <ref>[ Density functional theory from Wikipedia]</ref>
: Chapter 8 in [ Ross C. Walker], [ Andreas W. Götz] ('''2016'''). ''[ Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [ John Wiley & Sons]
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [ arXiv:1712.01815] » [[AlphaZero]]
* [[Tristan Cazenave]] ('''2017'''). ''[ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [ pdf]
* [ Jayvant Anantpur], [ Nagendra Gulur Dwarakanath], [ Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [ R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [ arXiv:1712.04303]
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[ A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [ Science], Vol. 362, No. 6419
* [ Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]
* [ GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]
* [ GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]
* [ How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[ GeForce 20 series from Wikipedia]</ref>
: [ Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018
* [ Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]
* [ LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]
* [ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[ Phoronix Test Suite from Wikipedia]</ref>* [ Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]* [ Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019» [[Cpp|C++]]
* [ Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019
* [ My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]
* [ GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019
==2020 ...==
* [ AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[ kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref>
* [ I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020
* [ Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020
* [ Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]
* [ GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021
* [ Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]
* [ Re: China boosts in silicon...] by [[Srdja Matovic]], [[CCC]], January 13, 2024
=External Links=
* [ General-purpose computing on graphics processing units (GPGPU) from Wikipedia]
* [ List of AMD graphics processing units from Wikipedia]
* [ List of Intel graphics processing units from Wikipedia]
* [ List of Nvidia graphics processing units from Wikipedia]
* [ NVIDIA Developer]

Navigation menu