How to Optimize Tensor Cores for 2–4x Faster Deep Learning Training

optimize Tensor Cores for faster deep learning

How to Optimize Tensor Cores for 2–4x Faster Deep Learning Training

If you are looking to speed up your deep learning training, Tensor Cores can make your models train faster with just a few simple changes to your code. In the world of AI, training time is a critical bottleneck; reducing it from weeks to days, or days to hours, can significantly accelerate your development cycle. This guide will show you how to optimize Tensor Cores to get better performance for deep learning.

Whether you are training on a local workstation or utilizing high-performance infrastructure like Perlod hosting, these hardware optimizations are essential for getting the most value out of your compute resources.

What Are Tensor Cores and How Do They Work?

Tensor Cores are special hardware units inside NVIDIA GPUs that are built to do matrix math very fast. Deep learning models spend most of their time doing matrix operations, so Tensor Cores can speed up training by a huge amount.

Regular CUDA cores multiply numbers one at a time, but Tensor Cores multiply entire blocks of numbers in a single step. This means Tensor Cores can do the same work 8-16 times faster than regular GPU cores for deep learning tasks.

Tensor Cores perform a special operation called fused multiply-add (FMA). They take two small matrices, multiply them together, and add the result to a third matrix, all in one clock cycle.

For example, a Tensor Core can calculate:

D = A × B + C

A, B, C, and D are 4×4 matrices, and this happens in a single step instead of many separate operations.

Which GPUs Have Tensor Cores?

Not all NVIDIA GPUs have Tensor Cores. Before you begin optimization, it is essential to verify your hardware capabilities.

Many newer consumer GPUs have Tensor Cores, but top-level performance usually needs powerful data center GPUs. Instead of buying this expensive hardware, many companies rent GPU dedicated servers so they can use these high-end cards without paying the full cost up front.

Here we provide a list of data center GPUs and consumer GPUs to see which GPUs have Tensor Cores.

Data Center GPUs include:

The H100 is currently the most popular GPU for serious AI training, with 4th generation Tensor Cores and FP8 support. If you are considering H100 for your workloads, check out our complete guide on NVIDIA H100 Hosting to understand specifications, costs, and how to get access.

Consumer GPUs include:

Any GPU in the RTX series has Tensor Cores.

Optimize Tensor Cores for Deep Learning Performance

In this step, we want to show you how to use NVIDIA’s Tensor Cores to make them train up to 4x faster. We’ll cover the simplest and most effective strategies to speed up your workflow.

1. Enable Mixed Precision Training

Most deep learning models have been trained with 32-bit (FP32) numbers, which are very precise. But in practice, models usually don’t need that much precision to learn well.

If we use lower precision, we can do calculations faster and store more data in GPU memory at the same time.

Mixed precision will give you 2-3x faster training, 50% less memory usage, and the same model accuracy.

The recommended method is PyTorch Mixed Precision. Here is the correct way to enable mixed precision in PyTorch:

import torch
from torch.cuda.amp import autocast, GradScaler

# Create your model and optimizer
model = YourModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()

# Create the GradScaler for FP16
scaler = GradScaler()

# Training loop
for epoch in range(num_epochs):
    for inputs, labels in dataloader:
        inputs, labels = inputs.cuda(), labels.cuda()
        
        # Clear gradients
        optimizer.zero_grad()
        
        # Forward pass with automatic mixed precision
        with autocast():
            outputs = model(inputs)
            loss = criterion(outputs, labels)
        
        # Backward pass with gradient scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

In this method:

  • autocast() automatically uses FP16 where safe and FP32 where needed​.
  • GradScaler prevents small gradients from becoming zero in FP16.

For TensorFlow Mixed Precision, you can use:

import tensorflow as tf
from tensorflow.keras import mixed_precision

# Enable mixed precision globally
mixed_precision.set_global_policy('mixed_float16')

# Check that it worked
print('Compute dtype:', mixed_precision.global_policy().compute_dtype)
print('Variable dtype:', mixed_precision.global_policy().variable_dtype)

# Build and train your model normally - no other changes needed
model = tf.keras.Sequential([
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=10)

After setting this policy, all your layers automatically use mixed precision.​

2. Choose the Right Number Format for GPU

Not all 16-bit types work the same way, and your GPU’s design decides which one is the most stable and fastest. Picking the right 16-bit format helps you avoid training problems.

Different number formats work better for different tasks. Here is what each format is good for:

For most users, BF16 is the best choice if your GPU supports it.

  • It has the same number range as FP32.
  • You don’t need GradScaler.
  • Training is more stable.

For older GPUs like V100 and RTX 20 series, FP16 works, but it needs GradScaler to prevent gradient underflow.

For A100 and newer GPUs, TF32 works automatically. TF32 is enabled by default in PyTorch, which gives you Tensor Core speed without changing any code.

To check or enable it, you can use:

import torch

# Check current settings
print(torch.backends.cuda.matmul.allow_tf32)  # True by default on Ampere+

# Enable if needed
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

For H100, H200, and Blackwell, you can consider using FP8, which gives the highest speed on the newest GPUs. Use NVIDIA’s Transformer Engine for easy FP8 training.

3. Dimension Alignment for Best Performance

Tensor Cores work best when your matrix sizes follow certain rules. The main rule is to make dimensions divisible by 8 for FP16, or divisible by 16 for INT8.

This rule applies to:

  • Batch size: Number of samples per training step.
  • Hidden layer sizes: Number of neurons in fully connected layers.
  • Channel counts: Input and output channels in conv layers.
  • Sequence lengths: For transformer models.

Examples of Good and Bad Dimensions Include:

When dimensions are not aligned, two things happen:

  1. The GPU may fall back to slower CUDA cores.
  2. Tensor Cores waste compute on padding.

A simple change like padding vocabulary size from 33,708 to 33,712 can give you noticeable speed improvements.

For maximum performance on A100 and newer:

  • Use multiples of 64 for FP16/BF16.
  • Use multiples of 128 for INT8.

4. Batch Size Optimization

Another thing that directly affects how well Tensor Cores work is batch size. The key rules of batch sizing include:

  • Start with powers of 2: Batch sizes like 32, 64, 128, or 256 work well​.
  • At minimum, use multiples of 8: If you can’t use powers of 2, at least use multiples of 8.
  • Bigger batches usually train faster: Larger batch sizes better use the GPU’s parallel power.

If your GPU doesn’t have enough memory for large batches, gradient accumulation lets you simulate bigger batches. To use gradient accumulation, you can use:

import torch
from torch.cuda.amp import autocast, GradScaler

model = YourModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()

accumulation_steps = 4  # Effective batch = batch_size × 4

for i, (inputs, labels) in enumerate(train_loader):
    inputs, labels = inputs.cuda(), labels.cuda()
    
    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, labels) / accumulation_steps
    
    scaler.scale(loss).backward()
    
    # Update weights every accumulation_steps batches
    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

This gives you an effective batch size 4 times larger than what fits in memory.​

5. Switch To Channels Last Memory Format

For image models (CNNs), switching to channels last memory format speeds up training. PyTorch stores image data as NCHW by default:

  • N = Batch size
  • C = Channels
  • H = Height
  • W = Width

Tensor Cores prefer NHWC format. Converting to the channels last memory format will speed up your training.

To enable Channels Last, you can use:

import torch

# Convert model to channels last
model = model.to(memory_format=torch.channels_last)

# Convert input to channels last
input = input.to(memory_format=torch.channels_last)

# Training works normally
output = model(input)

Here is a complete training example with Channels Last:

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

# Create model and convert to channels last
model = YourCNNModel().cuda()
model = model.to(memory_format=torch.channels_last)

optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()

for epoch in range(num_epochs):
    for images, labels in dataloader:
        # Convert images to channels last
        images = images.cuda().to(memory_format=torch.channels_last)
        labels = labels.cuda()
        
        optimizer.zero_grad()
        
        with autocast():
            outputs = model(images)
            loss = criterion(outputs, labels)
        
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

6. Enable cuDNN Benchmark Mode

cuDNN is NVIDIA’s library of optimized deep learning functions. Benchmark mode makes it automatically finds the fastest algorithms for your model.

To enable cuDNN benchmark mode, you can run:

import torch

# Enable cuDNN benchmark mode
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True

It is recommended to use cuDNN Benchmark when your input sizes stay the same during training, and you run many iterations with the same model.

Do not use the cuDNN benchmark when input sizes change between batches, and you only run a few iterations.

The first run will be slower because cuDNN tests different algorithms. After that, all runs use the fastest option found.​

7. Use torch.compile for Extra Speed

PyTorch 2.0 introduced “torch.compile“, which can give you additional 2-10x speedups on top of other optimizations.

The basic usage includes:

import torch

# Basic compilation
model = YourModel().cuda()
compiled_model = torch.compile(model)

# Use compiled model normally
output = compiled_model(input)

Optimization modes include:

# Default mode - good balance of compile time and speedup
compiled_model = torch.compile(model)

# Reduce overhead mode - faster for small batches
compiled_model = torch.compile(model, mode="reduce-overhead")

# Max autotune mode - slowest compile, fastest runtime
compiled_model = torch.compile(model, mode="max-autotune")

The “torch.compile” works by these things:

  • Tracing your model’s operations.
  • Fusing multiple operations into a single GPU kernel.
  • Reducing Python overhead.

The first run takes longer because the model compiles. After that, every run is faster.​

How to Check Tensor Cores Usage?

After implementing these changes, it is important to verify that the optimizations are working as intended.

To verify that Tensor Cores are actually being used, you can use NVIDIA’s profiling tools.

1. Use Nsight Compute. Profile your program and check tensor core usage with:

ncu --metrics sm__inst_executed_pipe_tensor_op_hmma.sum ./your_program

If the count is zero, Tensor Cores are not being used.

2. Use nvidia-smi. Monitor GPU usage in real-time with:

nvidia-smi dmon -s u

Common Reasons that Tensor Cores Don’t Activate include:

  • Dimensions not divisible by 8
  • Using FP32 without TF32 enabled
  • Mixed precision not enabled
  • Operations that don’t support Tensor Cores
  • cuDNN not enabled

Troubleshooting Common Issues in Tensor Cores

Optimizing for hardware can sometimes lead to unexpected behavior. Here are the most common issues that you may face in Tensor Cores optimizations and their solutions:

1. Training is not faster with mixed precision:

  • Check that your GPU has Tensor Cores.
  • Make sure dimensions are divisible by 8.
  • Verify autocast is wrapping your forward pass.

2. Loss becomes NaN or Inf:

  • Use GradScaler with FP16​.
  • Switch to BF16 if your GPU supports it.
  • Check for exploding gradients in your model.

3. Model accuracy drops with mixed precision: Some operations need FP32. Use autocast exclusions:

with autocast():
    # Most operations use FP16
    x = model.layer1(x)
    
    # Force FP32 for sensitive operations
    with autocast(enabled=False):
        x = x.float()
        x = sensitive_operation(x)

4. Out of memory errors:

  • Reduce batch size.
  • Use gradient checkpointing.
  • Enable mixed precision.

Quick Tensor Cores Optimization Checklist

You can use this checklist to make sure you are getting the most from Tensor Cores:

  • Enable mixed precision with autocast and GradScaler (PyTorch) or set_global_policy (TensorFlow).
  • Make dimensions divisible by 8 for batch size, hidden sizes, and channel counts.
  • Use batch sizes that are multiples of 8 or powers of 2.
  • Convert to channels last format for CNN models.
  • Enable cuDNN benchmark mode for consistent input sizes.
  • Use “torch.compile” in PyTorch 2.0+.
  • Verify with profiling tools that Tensor Cores are being used.

FAQs

Do I need to change my model to use Tensor Cores?

No. Just enable mixed precision training and make sure your dimensions are divisible by 8.

Can Tensor Cores be used for inference?

Yes, Tensor Cores speed up both training and inference.

What is TF32, and do I need to enable it?

TF32 is a format that gives Tensor Core speed with FP32 code. It’s enabled by default on A100 and newer GPUs in PyTorch. You don’t need to change any code to use it.

Conclusion

Tensor Cores are powerful hardware that can dramatically speed up your deep learning training. By applying these optimizations, you can expect 2-4x faster training with the same model accuracy.

Once you have optimized your training, the next challenge is deploying your models at scale. If you are building an AI application that needs to serve many users, read our guide on How to Build a Scalable GPU Backend for AI SaaS to learn about vLLM serving, Kubernetes autoscaling, and cost-saving techniques like semantic caching.

We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest updates and articles on GPU and AI Hosting.

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway
Perlod Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.