Jets 3.0.17 extreme performance issue

thampton · October 19, 2021, 4:37pm

After upgrading from Jets 3.0.13 to 3.0.17 all of our lambdas durations increased significantly … after rolling back the Jets version to 3.0.13 the durations reduced significantly. Same code different performance.

3.0.17 is increased the lambda cost for our system by 8 fold !!!

tung · October 24, 2021, 3:30pm

Bummer. Unable to reproduce

Deployed 2 demo jets apps:

Jets Version 3.0.13: https://github.com/tongueroo/jets-demo13
Jets Version 3.0.17: https://github.com/tongueroo/jets-demo17

Not seeing a significant difference:

~/environment/demo13 $ jets deploy
~/environment/demo13 $ jets url
API Gateway Endpoint: https://2zb5v74463.execute-api.us-west-2.amazonaws.com/dev
~/environment/demo13 $ time curl https://2zb5v74463.execute-api.us-west-2.amazonaws.com/dev
{"jets_version":"3.0.13","ruby_version":"2.7.4"}
real    0m0.097s
user    0m0.006s
sys     0m0.007s
~/environment/demo13 $ time curl https://2zb5v74463.execute-api.us-west-2.amazonaws.com/dev
{"jets_version":"3.0.13","ruby_version":"2.7.4"}
real    0m0.081s
user    0m0.009s
sys     0m0.005s
~/environment/demo13 $ cd ../demo17
~/environment/demo17 $ jets deploy
~/environment/demo17 $ jets url
API Gateway Endpoint: https://jnhxcpxcrk.execute-api.us-west-2.amazonaws.com/dev
~/environment/demo17 $ time curl https://jnhxcpxcrk.execute-api.us-west-2.amazonaws.com/dev
{"jets_version":"3.0.17","ruby_version":"2.7.4"}
real    0m0.077s
user    0m0.007s
sys     0m0.006s
~/environment/demo17 $ time curl https://jnhxcpxcrk.execute-api.us-west-2.amazonaws.com/dev
{"jets_version":"3.0.17","ruby_version":"2.7.4"}
real    0m0.091s
user    0m0.004s
sys     0m0.009s
~/environment/demo17 $

Wondering what’s different here. Wondering if you can try deploying both the demo apps above to your AWS account, and see if we can eliminate if it’s a specific environmental or app issue.

thampton · October 26, 2021, 10:24pm

@tung

I disabled prewarm and the increase in average duration seems to have resolved. Perhaps there is some problem with the prewarm feature? Does the prewarm job use the AWS SDK in a way that might affect the prewarm execution ?

thampton · October 26, 2021, 10:52pm

If it matters, we are using VPC.

tung · October 27, 2021, 1:17am

The 2 example apps above have prewarming enabled and it didn’t make a difference.

VPC seems culprit. VPC used to have extremely long cold start up times. AWS improved it by moving the network card provisioning from Lambda function invocation time to function creation time.

Wondering if AWS was tweaking something at the time. Maybe they’re still rolling out the change over time in your specific region? And you hit into some infrastructure with the older Lambda VPC software.

thampton · October 27, 2021, 3:17am

@tung

The VPN changes you mention have been fully implemented in our region for about 2 yrs now.

I’ve turned off pre-warming but not the VPC. The lambdas are still running in VPC. I would expect the issue to persist if VPC was the culprit.

It’s interesting that turning pre-warming off had a significant impact on AVG duration. We have 150 lambdas and quite a lot of volume as well. I think the duration issue may only be when handling “_prewarm” events.

Here is how our application.rb file is set up. Setting the rate to 10 min. may be too much, but was working well prior to the upgrade to 3.0.17.

Jets.application.configure do
  
  config.project_name = "xxx"

  config.mode = "api"
  config.api.endpoint_type = 'REGIONAL'

  config.function.memory_size = 3072

  config.function.vpc_config = {
    security_group_ids: [ 'redacted' ],
    subnet_ids: [ 'redacted', 'redacted' ]
  }

  config.prewarm.enable = true         # default is true
  config.prewarm.rate = '10 minutes'   # default is '30 minutes'
  config.prewarm.concurrency = 5       # default is 2

end

tung · October 27, 2021, 3:23am

Sure. Sounds like turning off prewarming somehow made average duration faster for the specific app. Unsure why Looking at the code diff:

Wondering if the rate limit changes is the reason for the slowness. And in a simple demo app the slowness won’t surface.

Taking a shot in the dark here. Believe the pre-warming loops through Lambda functions and that does trigger AWS api calls and triggering rate-limit related issues. So think will have to cache listing of the lambda functions. Will have to dig into this theory. This would explain it though!

Looks like can also hook into the retry logic with a hook as a Ruby lambda block https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/timeout-duration.html

Maybe that can be hooked into to see if this shot in the dark is what’s happening. Unsure when will get to this. Will consider PRs. Of course, no sweat either way.

Also looking for other ideas and clues.

thampton · October 27, 2021, 5:08am

I did try the sample apps you provided. Didn’t see any differences between demo13 & demo17 with or without prewarming enabled.

We are not using the Jets AWS clients. We create our own for use within the application code.

I think you are on to something wrt prewarmer making AWS api calls & the rate-limit changes.

By default Async lambda invocations will retry on failure. Might not want that behavior in the prewarmer.

It’s worth mentioning our app has 175 lambdas (just recounted) … but we have two deployments (staging, production) with prewarming enabled.

So that’s 350 lambdas in the account prewarming. Perhaps we are hitting rate limits on the SDK and the lambdas are also retrying.

I checked and we are getting a lot of errors on the jets-preheat_job-warm lambda. More errors than invocations (looks like retry’s happening) Fewer errors on the hcr-prod-jets-preheat_job-torch lambda. Sadly, retry’s will just make the problem worse in this case.

So if a warm job fails after the back-off strategy is exhausted, the lambda will be retried 2 more times before giving up, adding even more load on the SDK.

thampton · October 27, 2021, 2:13pm

@tung
Appreciate your help diagnosing this issue !

tung · October 27, 2021, 3:21pm

RE: I checked and we are getting a lot of errors on the jets-preheat_job-warm lambda.

Wondering what the error is? Interested in the stack trace.

thampton · October 27, 2021, 4:57pm

My guess is SDK rate limit errors. Nothing is logged for these errors.

APIGW calls lambdas synchronously, so errors get logged.

Preheat (EventBridge) is invoking lambdas asynchronously. After an async lambda fails all retry attempts, Lambda discards the event without logging anything.

The Jets generated handler function does not log exceptions either. Something like this would capture the error.

Jets::Controller

# Overrides Base.process
def self.process(event, context={}, meth)
  begin
    controller = new(event, context, meth)
    # Using send because process! is private method in Jets::RackController so
    # it doesnt create a lambda function.  It's doesnt matter what scope process!
    # is in Controller::Base because Jets lambda functions inheritance doesnt
    # include methods in Controller::Base.
    # TODO: Can process! be a protected method to avoid this?
    controller.send(:process!)
  rescue => e
    if event
    puts "error details"
    raise e
  end
end

tung · October 27, 2021, 6:40pm

Bummer that there’s no logs. Try running a fork of jets with your change when you get the chance. Maybe that’ll provide more useful info. No sweat either way, of course.