I had to make this entry because there was in no shape or form I would be able to merge my PR request because of Lack of resource. Not time but resource in this case I didn't have the right hardware to do it.
The Issue
As mentioned before the Issue was for a function in pytorch to accept kwargs** in its arguments. So as an issue it sounded straight forward but as I delve into it, it became harder and harder. So I had to navigate around pytorch and how it worked, I had to install countless of dependencies, libraries and software to make it work properly.
- Anaconda
- WSL (Windows Subsytem for Linux)
- CMAKE
- NCCL
- CUDA
First of I needed conda to install pytorch properly and all of its dependencies, installing different version from Pytorch installation
I tried using pip but inorder for me to work with the function I needed a few dependencies that I can only find in Anaconda
I installed Anaconda thinking I can now use pytorch, just to realize to run setup.py I need to install CMAKE but then when I installed it. it said software incompatible, so i did little digging and tried another way around, I tried to by-pass it and use CUDA which is a parallel computing platform and programming model developed by NVIDIA.
I read the function and said "Gloo back end would not support this API" So My only choice was MPI and NCCL. Which can be find hereDistributed packge for torch
So the thing that got me with NCCL or NVIDIA Collective Communications Library because it only works with linux system, so with that I installed WSL. I sat on it a for a few days trying to figure out as a whole who the distributed package with pytorch worked. I went back to the problem and tried to work on it as a separate entity.
The code I had to change was this
def all_gather_into_tensor(output_tensor, input_tensor, group=None, async_op=False)
into this
def all_gather_into_tensor(output_tensor, input_tensor, group=None, async_op=False, **kwargs):
Now I just had to test it out
Using a test
def setUp(self):
# Set up distributed environment for testing
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=0, world_size=1)
def tearDown(self):
# Clean the environment
dist.destroy_process_group()
def test_all_gather_into_tensor_without_kwargs(self):
# testing to see if kwargs work
output_tensor = torch.tensor([1, 2, 3])
input_tensor = torch.tensor([4, 5, 6])
group = None
async_op = False
result = all_gather_into_tensor(output_tensor, input_tensor, group, async_op)
self.assertIsNone(result)
with self.assertRaises(AttributeError):
_ = output_tensor.kwargs
with self.assertRaises(AttributeError):
_ = input_tensor.kwargs
This would work on its own ofcourse with a little bit of changing in the code.
After that I spend tim back in the installation process. After spending literally a week trying to install and make it work. I couldn't anymore.
I had to run back to the planning stage and also see for all_gather_into_tensor_inplace would work with this but it all boils down to the main functionality of all_gather_into_tensor.
I learned about what BLOO, MPI and NCCL is and that it will only work with linux systems, I was thinking that using WSL would help me out but to no avail.
It feels kind of a letdown but I learned a lot from this release, even if it wasn't a success and I wasn't able to merge it.
I would happily accept the learning process and the opportunity I receive from this..
Top comments (0)