question

Brandon H avatar image
2 Likes"
Brandon H asked Preet commented

Cloud nodes don't progress beyond "submitted" stage

After following this walk-through along with some information from this seemingly old procedure. I have ended up in a state where the experiments that I set to run on distributed cloud nodes reach the "Submitted" stage but never progress beyond that. I have allowed them to wait in that state for at most 10 minutes with no change. Do you have any ideas of what I could do to debug this?

I am using AWS with an EC2 instance that has 2vCPUs, 4GB RAM, and is running Windows Server 2022. Due to the low RAM amount, I entered "1" for the CPUs field in my local FlexSim setup in order to abide by the minimum/recommended requirements called out in Distributed experiments or optimizations.

Additionally, I seem to get this error in the System Console every few tries. Unsure if it is related.

exception: Experimenter[IP ADDRESS] did not return valid ports
exception: ExperimenterUnable to create child processes on cloud nodes
exception: Experimenter Could not get jobID
exception: FlexScript exception: MAIN:/project/library/Experimenter>behaviour/eventfunctions/assertScenario c: MODEL:/Tools/Experimenter
exception: FlexScript exception: MAIN:/project/library/Experimenter>behaviour/eventfunctions/assertTask c: MODEL:/Tools/Experimenter
exception: FlexScript exception: Array index out of bounds at MAIN:/project/library/Experimenter>behaviour/eventfunctions/approveTasks c: MODEL:/Tools/Experimenter
exception: treewin__CallUserCallback
ex: CallUserCallback
Other (please specify)
FlexSim 23.0.2
Other (please specify)
Other
Other
experimeterawsdistributed experiments
· 10
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Ben Wilson avatar image Ben Wilson ♦♦ commented ·

Hi @Brandon H,

The exceptions suggest that your main FlexSim instance is not able to successfully connect to the cloud nodes. Do you have the webserver running there? Are you able to hit those nodes' FlexSim webservers with an internet browser over their IP address and your chosen port number?

Also, the article mentions that your cloud nodes ought to meet FlexSim's recommended requirements - not the minimums. It could be that your nodes don't have enough RAM to run even one replication of your simulation. This depends on your model's exact requirements, of course, but I will note that in 2023 our minimum RAM spec is 8GB. As a test you could run a replication directly on one of your nodes while watching the task manager to make sure that you aren't getting close to your node's hardware limitations. Keep in mind that when running as a cloud node you'll need additional overhead at the end of the model run to store and report back any metrics your model keeps, so you may need several hundred more MB when this node runs your model as a cloud node, depending on what stats you're gathering. See the Experiments and Optimizations heading under Memory for more info about how a replication uses memory.

@Jordan Johnson, do you have any other suggestions or ideas for Brandon?

0 Likes 0 ·
Brandon H avatar image Brandon H Ben Wilson ♦♦ commented ·
@Ben Wilson Firstly, thank you for redirecting this as its own question. I will make sure to keep this in mind for future items.


The only checks that I had performed to determine if I was getting a proper connection were opening the webserver in a browser on the node itself and using the "Test Connections" button in the Global Preferences/Environment/Cloud Computing section of FlexSim on my local computer. The fact that the job replications hit the "Submitted" state but did not do anything leads me to believe that the connection was made but perhaps the resources available were not adequate. I was not aware that the new minimum RAM spec was 8GB. I will have to try getting a more suitable instance next time I try this and follow your instructions on monitoring the status of the node.

I will report back any findings after I conduct this test.

0 Likes 0 ·
Brandon H avatar image Brandon H commented ·


I have included some screenshots to hopefully help debug what I am doing wrong. Including some information about tools I used:
AWS Instance:
Config 1 -> Microsoft Server 2022, 4vCPUs, 16GB RAM, 30GB Storage
Config 2 -> Microsoft Server 2016, 4vCPUs, 16GB RAM, 30GB Storage
Experimenter Settings:
Job 1 -> 1 Scenario with 1 Replication

Added Data:
When the Experimenter is run on my laptop, the specific job seems to require a max of 300MB RAM. On the cloud node/instance, I see a spike that looks to be around 300MB RAM when I first send the job but the RAM drops back down to an idle state immediately afterwards. Something similar happens with the CPU usage. I have waited up to 20 minutes for the "Submitted" status to change to "Running".

experimenter-setup.png


Experimenter and cloud node setup

local-computer-to-node-webserver.png

Verification that Webserver can be reached

webserver.png

Remote Desktop view of cloud node/instance (moments after the Job was started on the local computer)

0 Likes 0 ·
Jeanette F avatar image Jeanette F ♦♦ commented ·

Hi @Brandon H, was Jordan Johnson's answer helpful? If so, please click the "Accept" button at the bottom of their answer. Or if you still have questions, add a comment and we'll continue the conversation.

If we haven't heard back from you within 3 business days we'll auto-accept an answer, but you can always unaccept and comment back to reopen your question.

0 Likes 0 ·
Preet avatar image Preet commented ·

Hi @Brandon H. Were you able to solve this problem? I am having similar issue where a simple model just gets submitted and it does not go beyond that.

0 Likes 0 ·
Brandon H avatar image Brandon H Preet commented ·
@Preet As I said in my latest reply it was an issue the network rules set by my company's IT department due to me being a remote employee. I found this out by testing the walkthrough on a spare computer and found immediate success running a full experiment. This indicated to me that the walkthrough was indeed correct and that there must be something unique about my work computer.


Based on my experience, I recommend looking into what rules are set on your local computer as well as double-checking your server's rules set to make sure there are no mistakes.

3 Likes 3 ·
Preet avatar image Preet Brandon H commented ·
I checked the inbound/outbound rules on EC2 and that as fine. I also checked firewall on my local PC and Flexsim is allowed app. I also checked on remote desktop. Both nodejs and firewall is allowed. The only thing I could think is I have enterprise license which I have to run by connecting to VPN to my employer network. Not sure, what to do next!
0 Likes 0 ·
Show more comments
Jason Lightfoot avatar image Jason Lightfoot ♦♦ Preet commented ·

You can see from Brandon's reply below that Jordan's answer "...did guide me down the right debugging paths to learn that it was something to do with my company's network protocols. I have since resolved the issue."

2 Likes 2 ·
Jordan Johnson avatar image
1 Like"
Jordan Johnson answered

@Brandon H It's not clear to me what's happening. It does seem likely that the connection isn't working correctly. These are the things I have thought of:

  • Be sure you have the latest version of the webserver installed on the EC2 instance.
  • Be sure you have the same version of FlexSim installed on the instance as you are running locally. For example, if you are using 23.0.2 on your local computer, install 23.0.3 on the EC2 instance/image.
  • Be sure that the webserver is running when the instance starts and is listening on port 80

Try testing a basic model:

  • Use a very simple model for testing, perhaps Source-Queue-Processor-Sink.
  • Verify that you can run that experiment normally/locally on your own computer
  • Verify that you can run the model on the EC2 instance. I think you can drag/drop or maybe copy/paste the model file to the EC2 instance. Then open and run it with FlexSim. Since it's a simple model, it should run fine in that environment.
  • Try running the experiment using the EC2 instance.

Try testing your model:

  • Make sure you can run every scenario (at least for a short amount of time) on your local computer.
    • If you use the experimenter, make sure that there are no system console messages in the performance measures.
    • If you run locally, make sure the scenario works.

If the simple model works and yours doesn't, then maybe there's a reason. Perhaps your model relies on files that aren't present on the EC2 instance? For example, I don't think reading from an excel file works unless you've installed Excel on the instance.

Here is some basic information about how the process works:

  • User runs an experiment that uses Remote CPUs.
  • FlexSim send a synchronous http request to the remote host
  • The host PC launches a FlexSim
  • That instance of FlexSim launches the specified number of child processes
  • Those child processes attempt to bind a listening TCP socket, starting with port 9000, and consuming one socket per child process
    • If they don't have permission to bind TCP listening sockets, then this might fail
    • If some other program has already bound itself to port 9000, this might fail.
  • The FlexSim process that launched the children pings each child process to detect that it is ready. Once all are ready, the main process writes a file called "ports.txt"
    • If your instance doesn't have a hard drive/storage for files, then this will fail
  • Once the ports.txt file is present, the webserver responds to the HTTP request to spawn child instances, telling FlexSim which ports to connect to.
  • The main/local FlexSim then connects to each of the remote FlexSims with a TCP socket.
  • The main/local FlexSim saves the entire main tree into memory.
  • The main/local sends a copy of the main tree to the child processes, which attempt to load the main tree.
    • If your model uses modules that aren't installed on the remote instance, this won't work.
  • After that, the experimenter submits the tasks. They are marked in the database as submitted.
  • The child processes are supposed to work as a team to do tasks as they come up.
    • For each task, the child process sets the parameters, resets the model, and then runs the model.
    • If something goes wrong during this process (setting parameters, resetting, and running) then the child process might vanish, leaving FlexSim waiting for work to happen that will never happen.
· 1
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Brandon H avatar image Brandon H commented ·
@Jeanette F Thanks for reminding me to close this out. While Jordan's comment just above did not contain the actual answer to my question, it did guide me down the right debugging paths to learn that it was something to do with my company's network protocols. I have since resolved the issue.


However, since his comment is a "comment" and not an "answer" I cannot seem to mark this question "answered". The only "answer" is the other response he gave me early on in this question's history.

Please let me know how I should proceed or, if you would, close out this question in whatever way you deem fit based on what I have relayed back here.

0 Likes 0 ·
Jordan Johnson avatar image
0 Likes"
Jordan Johnson answered Brandon H commented

I'm not sure what this would mean. How many instances are you launching? If you are launching more than 4 instances, I could see FlexSim spending more than 10 minutes starting the experiment.

Have you verified that your model can run in Windows Server 2022? I recently had trouble making a container in that version of windows, so I wonder if something about it doesn't work.

Other than that, I'm not sure what might be wrong. We'd probably need to look at an exact model to be more sure.

As far as the exceptions you are getting, are you starting new instances between each try, or are you trying multiple times with the same instance? The exceptions look like the instance still has a FlexSim child process running and consuming the port, so new experiments can't use that port.

· 7
5 |100000

Up to 12 attachments (including images) can be used with a maximum of 23.8 MiB each and 47.7 MiB total.

Brandon H avatar image Brandon H commented ·

I am unsure what you mean by "instances" here. In terms of the number of scenarios and replications per run, I had initially started this by trying 1 scenario and 5 replications. I have since cut that down to 1 scenario and 1 replication. I found that it required about 300MB of RAM when run on my local computer. I just opted to go for a 4 CPU and 16GB RAM EC2 instance and found the same result of the job status staying stuck at "Submitted" for ~10 minutes before I quit it.



Upon further testing, that exception error would show up if I stopped the test and started again immediately without closing out the FlexSim Program on my local computer. I did not touch anything on the cloud node/instance.

I will now try to use an older Windows Server version to see if that helps.

UPDATE - Switching to Microsoft Server 2016 did not change the outcome

0 Likes 0 ·
Jason Lightfoot avatar image Jason Lightfoot ♦♦ Brandon H commented ·

Can you share your webserver Configuration file?

0 Likes 0 ·
Brandon H avatar image Brandon H Jason Lightfoot ♦♦ commented ·
# This file needs to be in the same directory as flexsimserver.bat


General:
    Flexsim Program Directory:      %PROGRAMFILES%\FlexSim 2023\program
    Model Directory:                %DOCUMENTS%\FlexSim 2023 Projects
    Port:                           80
    Reply Timeout (milliseconds):   10000
    Max Instances (of Flexsim):     8
    Max Threads Per Instance:       max
    Ignore Auto Save Files:         yes
Remote Operations (security hazards):
    Model Uploading:                no
    Model Downloading:              no
    Model Deleting:                 no
    Max Upload Size (bytes):        10000000
Jobs:
    Flexsim Data Directory:         %AllUsersProfile%\FlexSim\FlexSim23.0
    Max Job Queue Length:           100
    Max Job Timeout (seconds):      3600
Windows Authentication:
    Use Windows Authentication:     no
    Restrict UserGroup Directories: no
    Active Directory:
        url:                        ldap://dc.domainName.com
        baseDN:                     dc=domainName,dc=com
        username:                   [email protected]
        password:                   password
Session:
    Enable:                         no
    Secret:                         flexsim secret
    Max Age (seconds):              3600
0 Likes 0 ·
Show more comments