Reinforcement Learning CUDA related error

Question

question

Felipe Capalbo asked Jan 30, '24 Jason Lightfoot commented Feb 12, '24

Reinforcement Learning CUDA related error

I built a reinforcement learning model and when I took the step to submit the trained file to a local host connected to FlexSim Reinforcement Learning, by executing "flexsim_inference.py", I get the following error:

Exception occurred during processing of request from ('127.0.0.1', 53187)
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2032.0_x64__qbz5n2kfra8p0\Lib\socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2032.0_x64__qbz5n2kfra8p0\Lib\socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2032.0_x64__qbz5n2kfra8p0\Lib\socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2032.0_x64__qbz5n2kfra8p0\Lib\socketserver.py", line 755, in __init__
    self.handle()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2032.0_x64__qbz5n2kfra8p0\Lib\http\server.py", line 436, in handle
    self.handle_one_request()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2032.0_x64__qbz5n2kfra8p0\Lib\http\server.py", line 424, in handle_one_request
    method()
  File "c:\Users\GAIVOTA_FLEXSIM\Documents\FLEXSIM FELIPE CAPALBO\ESTUDOS ML\Exercicio ML Warehousing\flexsim_reinforcement_learning\flexsim_inference.py", line 11, in do_GET
    self._handle_reply(params)
  File "c:\Users\GAIVOTA_FLEXSIM\Documents\FLEXSIM FELIPE CAPALBO\ESTUDOS ML\Exercicio ML Warehousing\flexsim_reinforcement_learning\flexsim_inference.py", line 30, in _handle_reply
    action, _states = FlexSimInferenceServer.model.predict(observation)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GAIVOTA_FLEXSIM\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\stable_baselines3\common\base_class.py", line 553, in predict
    return self.policy.predict(observation, state, episode_start, deterministic)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GAIVOTA_FLEXSIM\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\stable_baselines3\common\policies.py", line 363, in predict
    obs_tensor, vectorized_env = self.obs_to_tensor(observation)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GAIVOTA_FLEXSIM\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\stable_baselines3\common\policies.py", line 274, in obs_to_tensor
    obs_tensor = obs_as_tensor(observation, self.device)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GAIVOTA_FLEXSIM\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\stable_baselines3\common\utils.py", line 483, in obs_as_tensor
    return th.as_tensor(obs, device=device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The model consists on a Warehousing problem and the observation parameter was set to the Type of the Item to be allocated and the actions was set to choose one of the 3 available racks.

ExercicioWarehousing.fsm

Software Version:

FlexSim 24.0.0

reinforcement learning

exerciciowarehousing.fsm (105.0 KiB)

· 1

5 |100000

Attachments: Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Jason Lightfoot ♦♦ commented · Feb 12 at 10:36 AM

Hi @Felipe Capalbo, was Nil Ns's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always comment back to reopen your question.

0 ·

Answer 1 · 2024-02-05T13:43:23Z

Nil Ns answered Feb 5, '24 Nil Ns edited Feb 6, '24

Hello Felipe,

It seems that your error is originating from Python, not FlexSim. The error is related to CUDA, which is a library used to execute Python code on the GPU. It appears that there's an issue with this, but it's unclear what exactly is causing it.

I've done some research and one possible solution could be to run in Python the following test code to pinpoint the exact location of the error:

# Importing torch library
import torch
# Setting the environment variable CUDA_LAUNCH_BLOCKING to 1
# This will make the CUDA operations synchronized and report the error immediately
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
# Trying to run a simple tensor operation on GPU
x = torch.randn(2, 2).cuda()
y = torch.randn(2, 2).cuda()
z = x + y
# Printing the result
print(z)
# Output: RuntimeError: CUDA error: device-side assert triggered
# The error message will also show the exact line and file where the error occurred

· 2

5 |100000

Attachments: Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Felipe Capalbo commented · Feb 06 at 12:06 PM

tensor([[-2.8758,  0.4874],
        [ 0.7963,  0.2493]], device='cuda:0')

tensor([[-0.7432,  1.4007],
        [-0.3974,  1.5274]], device='cuda:0')

I ran two times, and those were the results

0 ·

Nil Ns Felipe Capalbo commented · Feb 06 at 12:52 PM

Heyy Felipe,

This code was only meant to test if random operations would fail. Since it returned results and not errors, it seems that there is no configuration error with CUDA.

I’m not an expert on this topic, but without all the documents, it’s hard to know what’s happening.

I see that the error occurs during the observation. Could it be that the RL was trained with a different number of observation data than the model you’re trying to connect now?

If it’s not something like that, I recommend you try running this line of code before the rest of the code to get a more precise error message:

import os 
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

If no other error appears with this, it could also be that the GPU does not have enough capacity to perform that calculation.

0 ·

question