question

Brandon H avatar image
2 Likes"
Brandon H asked Preet commented

Cloud nodes don't progress beyond "submitted" stage

After following this walk-through along with some information from this seemingly old procedure. I have ended up in a state where the experiments that I set to run on distributed cloud nodes reach the "Submitted" stage but never progress beyond that. I have allowed them to wait in that state for at most 10 minutes with no change. Do you have any ideas of what I could do to debug this?

I am using AWS with an EC2 instance that has 2vCPUs, 4GB RAM, and is running Windows Server 2022. Due to the low RAM amount, I entered "1" for the CPUs field in my local FlexSim setup in order to abide by the minimum/recommended requirements called out in Distributed experiments or optimizations.

Additionally, I seem to get this error in the System Console every few tries. Unsure if it is related.

exception: Experimenter[IP ADDRESS] did not return valid ports
exception: ExperimenterUnable to create child processes on cloud nodes
exception: Experimenter Could not get jobID
exception: FlexScript exception: MAIN:/project/library/Experimenter>behaviour/eventfunctions/assertScenario c: MODEL:/Tools/Experimenter
exception: FlexScript exception: MAIN:/project/library/Experimenter>behaviour/eventfunctions/assertTask c: MODEL:/Tools/Experimenter
exception: FlexScript exception: Array index out of bounds at MAIN:/project/library/Experimenter>behaviour/eventfunctions/approveTasks c: MODEL:/Tools/Experimenter
exception: treewin__CallUserCallback
ex: CallUserCallback
Other (please specify)
FlexSim 23.0.2
Other (please specify)
Other
Other
experimeterawsdistributed experiments
· 10
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Ben Wilson avatar image Ben Wilson ♦♦ commented ·

Hi @Brandon H,

The exceptions suggest that your main FlexSim instance is not able to successfully connect to the cloud nodes. Do you have the webserver running there? Are you able to hit those nodes' FlexSim webservers with an internet browser over their IP address and your chosen port number?

Also, the article mentions that your cloud nodes ought to meet FlexSim's recommended requirements - not the minimums. It could be that your nodes don't have enough RAM to run even one replication of your simulation. This depends on your model's exact requirements, of course, but I will note that in 2023 our minimum RAM spec is 8GB. As a test you could run a replication directly on one of your nodes while watching the task manager to make sure that you aren't getting close to your node's hardware limitations. Keep in mind that when running as a cloud node you'll need additional overhead at the end of the model run to store and report back any metrics your model keeps, so you may need several hundred more MB when this node runs your model as a cloud node, depending on what stats you're gathering. See the Experiments and Optimizations heading under Memory for more info about how a replication uses memory.

@Jordan Johnson, do you have any other suggestions or ideas for Brandon?

0 Likes 0 ·
Show more comments
Brandon H avatar image Brandon H commented ·


I have included some screenshots to hopefully help debug what I am doing wrong. Including some information about tools I used:
AWS Instance:
Config 1 -> Microsoft Server 2022, 4vCPUs, 16GB RAM, 30GB Storage
Config 2 -> Microsoft Server 2016, 4vCPUs, 16GB RAM, 30GB Storage
Experimenter Settings:
Job 1 -> 1 Scenario with 1 Replication

Added Data:
When the Experimenter is run on my laptop, the specific job seems to require a max of 300MB RAM. On the cloud node/instance, I see a spike that looks to be around 300MB RAM when I first send the job but the RAM drops back down to an idle state immediately afterwards. Something similar happens with the CPU usage. I have waited up to 20 minutes for the "Submitted" status to change to "Running".

experimenter-setup.png


Experimenter and cloud node setup

local-computer-to-node-webserver.png

Verification that Webserver can be reached

webserver.png

Remote Desktop view of cloud node/instance (moments after the Job was started on the local computer)

0 Likes 0 ·
Jeanette F avatar image Jeanette F ♦♦ commented ·

Hi @Brandon H, was Jordan Johnson's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always unaccept and comment back to reopen your question.

0 Likes 0 ·
Preet avatar image Preet commented ·

Hi @Brandon H. Were you able to solve this problem? I am having similar issue where a simple model just gets submitted and it does not go beyond that.

0 Likes 0 ·
Show more comments
Jordan Johnson avatar image
1 Like"
Jordan Johnson answered

@Brandon H It's not clear to me what's happening. It does seem likely that the connection isn't working correctly. These are the things I have thought of:

  • Be sure you have the latest version of the webserver installed on the EC2 instance.
  • Be sure you have the same version of FlexSim installed on the instance as you are running locally. For example, if you are using 23.0.2 on your local computer, install 23.0.3 on the EC2 instance/image.
  • Be sure that the webserver is running when the instance starts and is listening on port 80

Try testing a basic model:

  • Use a very simple model for testing, perhaps Source-Queue-Processor-Sink.
  • Verify that you can run that experiment normally/locally on your own computer
  • Verify that you can run the model on the EC2 instance. I think you can drag/drop or maybe copy/paste the model file to the EC2 instance. Then open and run it with FlexSim. Since it's a simple model, it should run fine in that environment.
  • Try running the experiment using the EC2 instance.

Try testing your model:

  • Make sure you can run every scenario (at least for a short amount of time) on your local computer.
    • If you use the experimenter, make sure that there are no system console messages in the performance measures.
    • If you run locally, make sure the scenario works.

If the simple model works and yours doesn't, then maybe there's a reason. Perhaps your model relies on files that aren't present on the EC2 instance? For example, I don't think reading from an excel file works unless you've installed Excel on the instance.

Here is some basic information about how the process works:

  • User runs an experiment that uses Remote CPUs.
  • FlexSim send a synchronous http request to the remote host
  • The host PC launches a FlexSim
  • That instance of FlexSim launches the specified number of child processes
  • Those child processes attempt to bind a listening TCP socket, starting with port 9000, and consuming one socket per child process
    • If they don't have permission to bind TCP listening sockets, then this might fail
    • If some other program has already bound itself to port 9000, this might fail.
  • The FlexSim process that launched the children pings each child process to detect that it is ready. Once all are ready, the main process writes a file called "ports.txt"
    • If your instance doesn't have a hard drive/storage for files, then this will fail
  • Once the ports.txt file is present, the webserver responds to the HTTP request to spawn child instances, telling FlexSim which ports to connect to.
  • The main/local FlexSim then connects to each of the remote FlexSims with a TCP socket.
  • The main/local FlexSim saves the entire main tree into memory.
  • The main/local sends a copy of the main tree to the child processes, which attempt to load the main tree.
    • If your model uses modules that aren't installed on the remote instance, this won't work.
  • After that, the experimenter submits the tasks. They are marked in the database as submitted.
  • The child processes are supposed to work as a team to do tasks as they come up.
    • For each task, the child process sets the parameters, resets the model, and then runs the model.
    • If something goes wrong during this process (setting parameters, resetting, and running) then the child process might vanish, leaving FlexSim waiting for work to happen that will never happen.
· 1
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Brandon H avatar image Brandon H commented ·
@Jeanette F Thanks for reminding me to close this out. While Jordan's comment just above did not contain the actual answer to my question, it did guide me down the right debugging paths to learn that it was something to do with my company's network protocols. I have since resolved the issue.


However, since his comment is a "comment" and not an "answer" I cannot seem to mark this question "answered". The only "answer" is the other response he gave me early on in this question's history.

Please let me know how I should proceed or, if you would, close out this question in whatever way you deem fit based on what I have relayed back here.

0 Likes 0 ·
Jordan Johnson avatar image
0 Likes"
Jordan Johnson answered Brandon H commented

I'm not sure what this would mean. How many instances are you launching? If you are launching more than 4 instances, I could see FlexSim spending more than 10 minutes starting the experiment.

Have you verified that your model can run in Windows Server 2022? I recently had trouble making a container in that version of windows, so I wonder if something about it doesn't work.

Other than that, I'm not sure what might be wrong. We'd probably need to look at an exact model to be more sure.

As far as the exceptions you are getting, are you starting new instances between each try, or are you trying multiple times with the same instance? The exceptions look like the instance still has a FlexSim child process running and consuming the port, so new experiments can't use that port.

· 7
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Brandon H avatar image Brandon H commented ·

I am unsure what you mean by "instances" here. In terms of the number of scenarios and replications per run, I had initially started this by trying 1 scenario and 5 replications. I have since cut that down to 1 scenario and 1 replication. I found that it required about 300MB of RAM when run on my local computer. I just opted to go for a 4 CPU and 16GB RAM EC2 instance and found the same result of the job status staying stuck at "Submitted" for ~10 minutes before I quit it.



Upon further testing, that exception error would show up if I stopped the test and started again immediately without closing out the FlexSim Program on my local computer. I did not touch anything on the cloud node/instance.

I will now try to use an older Windows Server version to see if that helps.

UPDATE - Switching to Microsoft Server 2016 did not change the outcome

0 Likes 0 ·
Jason Lightfoot avatar image Jason Lightfoot ♦♦ Brandon H commented ·

Can you share your webserver Configuration file?

0 Likes 0 ·
Brandon H avatar image Brandon H Jason Lightfoot ♦♦ commented ·
# This file needs to be in the same directory as flexsimserver.bat


General:
    Flexsim Program Directory:      %PROGRAMFILES%\FlexSim 2023\program
    Model Directory:                %DOCUMENTS%\FlexSim 2023 Projects
    Port:                           80
    Reply Timeout (milliseconds):   10000
    Max Instances (of Flexsim):     8
    Max Threads Per Instance:       max
    Ignore Auto Save Files:         yes
Remote Operations (security hazards):
    Model Uploading:                no
    Model Downloading:              no
    Model Deleting:                 no
    Max Upload Size (bytes):        10000000
Jobs:
    Flexsim Data Directory:         %AllUsersProfile%\FlexSim\FlexSim23.0
    Max Job Queue Length:           100
    Max Job Timeout (seconds):      3600
Windows Authentication:
    Use Windows Authentication:     no
    Restrict UserGroup Directories: no
    Active Directory:
        url:                        ldap://dc.domainName.com
        baseDN:                     dc=domainName,dc=com
        username:                   [email protected]
        password:                   password
Session:
    Enable:                         no
    Secret:                         flexsim secret
    Max Age (seconds):              3600
0 Likes 0 ·
Show more comments