question

Brandon H avatar image
2 Likes"
Brandon H asked Preet commented

Cloud nodes don't progress beyond "submitted" stage

After following this walk-through along with some information from this seemingly old procedure. I have ended up in a state where the experiments that I set to run on distributed cloud nodes reach the "Submitted" stage but never progress beyond that. I have allowed them to wait in that state for at most 10 minutes with no change. Do you have any ideas of what I could do to debug this?

I am using AWS with an EC2 instance that has 2vCPUs, 4GB RAM, and is running Windows Server 2022. Due to the low RAM amount, I entered "1" for the CPUs field in my local FlexSim setup in order to abide by the minimum/recommended requirements called out in Distributed experiments or optimizations.

Additionally, I seem to get this error in the System Console every few tries. Unsure if it is related.

  1. exception: Experimenter[IP ADDRESS] did not return valid ports
  2. exception: ExperimenterUnable to create child processes on cloud nodes
  3. exception: Experimenter Could not get jobID
  4. exception: FlexScript exception: MAIN:/project/library/Experimenter>behaviour/eventfunctions/assertScenario c: MODEL:/Tools/Experimenter
  5. exception: FlexScript exception: MAIN:/project/library/Experimenter>behaviour/eventfunctions/assertTask c: MODEL:/Tools/Experimenter
  6. exception: FlexScript exception: Array index out of bounds at MAIN:/project/library/Experimenter>behaviour/eventfunctions/approveTasks c: MODEL:/Tools/Experimenter
  7. exception: treewin__CallUserCallback
  8. ex: CallUserCallback
Other (please specify)
FlexSim 23.0.2
Other (please specify)
Other
Other
experimeterawsdistributed experiments
· 10
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Jordan Johnson avatar image
1 Like"
Jordan Johnson answered

@Brandon H It's not clear to me what's happening. It does seem likely that the connection isn't working correctly. These are the things I have thought of:

  • Be sure you have the latest version of the webserver installed on the EC2 instance.
  • Be sure you have the same version of FlexSim installed on the instance as you are running locally. For example, if you are using 23.0.2 on your local computer, install 23.0.3 on the EC2 instance/image.
  • Be sure that the webserver is running when the instance starts and is listening on port 80

Try testing a basic model:

  • Use a very simple model for testing, perhaps Source-Queue-Processor-Sink.
  • Verify that you can run that experiment normally/locally on your own computer
  • Verify that you can run the model on the EC2 instance. I think you can drag/drop or maybe copy/paste the model file to the EC2 instance. Then open and run it with FlexSim. Since it's a simple model, it should run fine in that environment.
  • Try running the experiment using the EC2 instance.

Try testing your model:

  • Make sure you can run every scenario (at least for a short amount of time) on your local computer.
    • If you use the experimenter, make sure that there are no system console messages in the performance measures.
    • If you run locally, make sure the scenario works.

If the simple model works and yours doesn't, then maybe there's a reason. Perhaps your model relies on files that aren't present on the EC2 instance? For example, I don't think reading from an excel file works unless you've installed Excel on the instance.

Here is some basic information about how the process works:

  • User runs an experiment that uses Remote CPUs.
  • FlexSim send a synchronous http request to the remote host
  • The host PC launches a FlexSim
  • That instance of FlexSim launches the specified number of child processes
  • Those child processes attempt to bind a listening TCP socket, starting with port 9000, and consuming one socket per child process
    • If they don't have permission to bind TCP listening sockets, then this might fail
    • If some other program has already bound itself to port 9000, this might fail.
  • The FlexSim process that launched the children pings each child process to detect that it is ready. Once all are ready, the main process writes a file called "ports.txt"
    • If your instance doesn't have a hard drive/storage for files, then this will fail
  • Once the ports.txt file is present, the webserver responds to the HTTP request to spawn child instances, telling FlexSim which ports to connect to.
  • The main/local FlexSim then connects to each of the remote FlexSims with a TCP socket.
  • The main/local FlexSim saves the entire main tree into memory.
  • The main/local sends a copy of the main tree to the child processes, which attempt to load the main tree.
    • If your model uses modules that aren't installed on the remote instance, this won't work.
  • After that, the experimenter submits the tasks. They are marked in the database as submitted.
  • The child processes are supposed to work as a team to do tasks as they come up.
    • For each task, the child process sets the parameters, resets the model, and then runs the model.
    • If something goes wrong during this process (setting parameters, resetting, and running) then the child process might vanish, leaving FlexSim waiting for work to happen that will never happen.
· 1
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Jordan Johnson avatar image
0 Likes"
Jordan Johnson answered Brandon H commented

I'm not sure what this would mean. How many instances are you launching? If you are launching more than 4 instances, I could see FlexSim spending more than 10 minutes starting the experiment.

Have you verified that your model can run in Windows Server 2022? I recently had trouble making a container in that version of windows, so I wonder if something about it doesn't work.

Other than that, I'm not sure what might be wrong. We'd probably need to look at an exact model to be more sure.

As far as the exceptions you are getting, are you starting new instances between each try, or are you trying multiple times with the same instance? The exceptions look like the instance still has a FlexSim child process running and consuming the port, so new experiments can't use that port.

· 7
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.