Horovod has quite a few parameters that need to be set correctly to enable the best distributed deep learning performance. This guide takes you through the complete installation process, requirements, and configurations for a smooth setup.
Installation of Horovod System Requirements
Operating System Requirements
- Linux distributions (Ubuntu 18.04+ recommended)
- macOS (limited functionality)
- Not officially supported on Windows
Hardware Prerequisites
- Processor: Multi-core processor (modern CPU)
- Memory: At least 8GB RAM (16GB+ recommended)
- Storage: 20GB+ free space
- GPU: CUDA supports NVIDIA GPUs (recommended but optional)
Software Dependencies
- Python 3.6 or newer
- C++ compiler (g++-5 or above)
- CMake 3.13+
- Developers must install the CUDA Toolkit (if they want GPU support)
- NCCL 2 (for optimal GPU performance)
Preparing Your Environment
Python Environment Setup
Make sure that your Python environment is correctly configured before installing Horovod:
Open a terminal and create a new virtual environment:
- Recommended usage is via conda or virtualenv
- It isolates dependencies to get better management
- Avoid conflict with other projects
Install required frameworks:
- TensorFlow (1.15.0 or newer)
- PyTorch (1.5.0 or newer)
- MxNet (1.4.1 or newer)
System Package Installation
System prerequisites for Horovod installation needed at the beginning:
- Development tools
- MPI implementation
- CUDA drivers (to support configuration with the GPU)
- Network libraries
Horovod Installation Methods
Basic Installation
Although they are not installed by default, they are the simple and common usage that can be installed using pip’s method:
- Installs base Horovod package
- Includes essential features
- This is designed to work for simple, distributed training
Step 1: Install Framework-Specific
Select installation options based on your deep learning framework:
TensorFlow Support:
- Guarantees TensorFlow compatibility
- Enables distributed TensorFlow training
- Won’t include TensorFlow-specific optimizations
PyTorch Support:
- Provides PyTorch distributed support
- Contains PyTorch specific features
- Optimizes GPU communication
MxNet Support:
- Adds MxNet compatibility
- Supports distributed training of MxNet
- Includes necessary adapters
Advanced Installation Options
For specialized needs and optimizations:
GPU Support:
- NCCL integration
- GPU-aware communication
- Enhanced performance features
CPU Optimization:
- Intel’s oneCCL support
- Performance Features at the CPU Level
- Advanced threading options
Preparing and Optimizing
Environment Variables
Here are some key environment variables you should set:
Framework Selection:
- HOROVOD_WITH_TENSORFLOW
- HOROVOD_WITH_PYTORCH
- HOROVOD_WITH_MXNET
GPU Configuration:
- HOROVOD_GPU_OPERATIONS
- HOROVOD_CUDA_HOME
- HOROVOD_NCCL_HOME
Build Options:
- HOROVOD_BUILD_FLAGS
- HOROVOD_CMAKE
- HOROVOD_CPU_OPERATIONS
Performance Tuning
Optimize your Horovod installation:
Communication Backend:
- MPI configuration
- Gloo settings
- NCCL parameters
Memory Management:
- Cache size adjustment
- Buffer allocation
- Memory limits
Verification and Testing
Installation Verification
Check the installation of horovod:
Basic Checks:
- Version verification
- Framework compatibility
- Feature availability
Comprehensive Testing:
- Communication tests
- GPU functionality
- Framework integration
Common Issues and Solutions
Common installation issues and how to resolve them:
Dependency Issues:
- Missing packages
- Version conflicts
- Library incompatibilities
Build Problems:
- Compiler errors
- CUDA issues
- MPI configuration
Cloud Platform Installation
AWS Setup
- AMI’s selection
- Instance configuration
- Network set
Google Cloud Platform
- VM configuration
- GPU setup
- Network optimization
Azure Configuration
- VM size selection
- GPU enablement
- Network settings
Maintenance and Updates
Regular Updates
- Version management
- Security patches
- Feature additions
Backup and Recovery
- Configuration backup
- Environment snapshots
- Recovery procedures
Tips and Recommendations
Production Environment
- Security considerations
- Performance optimization
- Monitoring setup
Development Step
- Debug configuration
- Testing environment
- Continuous integration
Conclusion
It is recommended that you have the system requirements, dependencies, and options needed to install Horovod ready in your mind before you start to install it for the first time. This all-inclusive guide will help you set up the environment correctly so you can train a distributed deep learning model easily. That said, make sure to periodically update and maintain your installation to take advantage of new features and enhancements in distributed training capabilities.
This section provides a solid base for a solid and optimized Horovod setup whether you are configuring Horovod for research, development of production running. This guide will be a point of reference for distributed deep learning as you proceed with your deep learning journey.