Fix torch multiprocessing error in gptneox conversion script by rohithkrn · Pull Request #587 · NVIDIA/FasterTransformer

rohithkrn · 2023-05-03T20:39:41Z

Fixes:

Context has already been set error from torch multiprocessing similar to [BUG FIX] place multi-processing init to main method #443
I think the device_count is incorrectly set in the example script.

lanking520 · 2023-05-03T22:37:00Z

zhang-ge-hao · 2023-05-04T08:26:32Z

Fixes:

Context has already been set error from torch multiprocessing similar to [BUG FIX] place multi-processing init to main method #443

I think the device_count is incorrectly set in the example script.

can you provide some buggy cases that the current script can not deal with?

rohithkrn · 2023-05-04T17:41:45Z

@AkiyamaYummy
Could you explain why is this needed https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py#L141?

The command I am using is:
python huggingface_gptneox_convert.py -i EleutherAI/gpt-neox-20b -o $PWD/artifacts -i_g 2 -m_n gptneox -weight_data_type fp16

It fails at the above line #L141 with error because I don't have that directory. I see in the gptneox_guide, it suggests to clone the model git lfs clone https://huggingface.co/<MODEL_GROUP>/<MODEL_NAME>. Why do we need to do this and it's not required for other models? GPTNeoXForCausalLM.from_pretrained() will download the model anyway for HF.

File "/workspace/fastertransformer/FasterTransformer/python/fastertransformer/examples/gptneox/huggingface_gptneox_convert.py", line 141, in split_and_convert
    huggingface_model_file_list = [__fn for __fn in os.listdir(args.in_file) if __fn.endswith(".bin")]
FileNotFoundError: [Errno 2] No such file or directory: 'EleutherAI/gpt-neox-20b'

To get around the error,

I added the check and ran the same command.

if os.path.exists(args.in_file):
        huggingface_model_file_list = [__fn for __fn in os.listdir(args.in_file) if __fn.endswith(".bin")]
else:
        huggingface_model_file_list = []

It runs in to the error which is similar to #443

Traceback (most recent call last):                                                     
  File "/workspace/fastertransformer/FasterTransformer/python/fastertransformer/examples/gptneox/huggingface_gptneox_convert.py", line 254, in <module>
    split_and_convert(args)             
  File "/workspace/fastertransformer/FasterTransformer/python/fastertransformer/examples/gptneox/huggingface_gptneox_convert.py", line 149, in split_and_convert               
    torch.multiprocessing.set_start_method("spawn")                                    
  File "/usr/lib/python3.9/multiprocessing/context.py", line 243, in set_start_method  
    raise RuntimeError('context has already been set')                                 
RuntimeError: context has already been set

Therefore, I introduced the change in this PR to fix the errors.

rohithkrn · 2023-05-05T16:15:26Z

@byshiue @AkiyamaYummy

zhang-ge-hao · 2023-05-06T03:21:33Z

@rohithkrn

By using GPTNeoXForCausalLM.from_pretrained(), you can still use the current script after the default model download process by locally finding the model's storage path.

Of course, you can also make this script more convenient by being compatible with this scenario.

PS: Some of us, including myself, prefer to maintain Huggingface's model files on our own because Huggingface defaults to downloading models to its cache folder, and I don't want large files to take up a lot of my hard disk resources without being visible in my workspace.

rohithkrn · 2023-05-08T17:02:19Z

@AkiyamaYummy that makes sense, my change will not break that behavior.

rohithkrn · 2023-05-10T22:24:01Z

@byshiue @AkiyamaYummy reminder for review.

fix torch multiprocessing error

2c4b918

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix torch multiprocessing error in gptneox conversion script#587

Fix torch multiprocessing error in gptneox conversion script#587
rohithkrn wants to merge 1 commit intoNVIDIA:mainfrom
rohithkrn:gptneo_fix

rohithkrn commented May 3, 2023

Uh oh!

lanking520 commented May 3, 2023

Uh oh!

zhang-ge-hao commented May 4, 2023

Uh oh!

rohithkrn commented May 4, 2023 •

edited

Loading

Uh oh!

rohithkrn commented May 5, 2023

Uh oh!

zhang-ge-hao commented May 6, 2023

Uh oh!

rohithkrn commented May 8, 2023

Uh oh!

rohithkrn commented May 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rohithkrn commented May 3, 2023

Uh oh!

lanking520 commented May 3, 2023

Uh oh!

zhang-ge-hao commented May 4, 2023

Uh oh!

rohithkrn commented May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohithkrn commented May 5, 2023

Uh oh!

zhang-ge-hao commented May 6, 2023

Uh oh!

rohithkrn commented May 8, 2023

Uh oh!

rohithkrn commented May 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohithkrn commented May 4, 2023 •

edited

Loading