Build and run Docker containers leveraging NVIDIA GPUs
-
Updated
Sep 16, 2020 - Makefile
{{ message }}
Build and run Docker containers leveraging NVIDIA GPUs
A flexible framework of neural networks for deep learning
Parallel and Heterogeneous Task Programming in Modern C++
Go package for computer vision using OpenCV 4 and beyond.
Current implementation of join can be improved by performing the operation in a single call to the backend kernel instead of multiple calls.
This is a fairly easy kernel and may be a good issue for someone getting to know CUDA/ArrayFire internals. Ping me if you want additional info.
PR NVIDIA/cub#218 fixes this CUB's radix sort. We should:
HIP: C++ Heterogeneous-Compute Interface for Portability
We do not have documentation specifying the different treelite Operator values that FIL supports. (https://github.com/dmlc/treelite/blob/46c8390aed4491ea97a017d447f921efef9f03ef/include/treelite/base.h#L40)
Report needed documentation
https://github.com/rapidsai/cuml/blob/branch-0.15/cpp/test/sg/fil_test.cu
There are multiple places in the fil_test.cu file
I often use -v just to see that something is going on, but a progress bar (enabled by default) would serve the same purpose and be more concise.
We can just factor out the code from futhark bench for this.
PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer
A community run, 5-day PyTorch Deep Learning Bootcamp
ThunderSVM: A Fast SVM Library on GPUs and CPUs
CUDA integration for Python, plus shiny features
an implementation of 3D Ken Burns Effect from a Single Image using PyTorch
CUDA Templates for Linear Algebra Subroutines
Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
GraphVite: A General and High-performance Graph Embedding System
Segmented reduce uses the same template type OffsetIteratorT for begin and end offsets
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedReduce::Sum
( void * d_temp_storage,
size_t & temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
int num_segments,
OffsetIterato
Add a description, image, and links to the cuda topic page so that developers can more easily learn about it.
To associate your repository with the cuda topic, visit your repo's landing page and select "manage topics."
Current default value for
rows_per_chunkparameter of the CSV writer is 8, which means that the input table is by default broken into many small slices that are written out sequentially. This reduces the performance by an order on magnitude in some cases.In Python layer, the default is the number of rows (i.e. write table out in a single pass). We can follow this by setting
rows_per_chunk