The Wayback Machine - http://web.archive.org/web/20201002020603/https://github.com/AutoSpotting/AutoSpotting
Skip to content

Saves up to 90% of AWS EC2 costs by automating the use of spot instances on existing AutoScaling groups. Installs in minutes using CloudFormation or Terraform. Convenient to deploy at scale using StackSets. Uses tagging to avoid launch configuration changes. Automated spot termination handling. Reliable fallback to on-demand instances.

master
Go to file
Code

Latest commit

* Implement event-based instance replacement

Context:

This change attempts to replace on-demand instances with spot as soon as they
are launched (triggered by the on-demand instance entering the pending state),
in order to avoid runnin expensive on-demand capacity in parallel with the
newly launched spot instances. This should cover new groups out of the box, as
well as scaling out or replacement of terminated spot instances.

It has a few nice side effects concerning cost, since on-demand instances are
likely to run for just a few seconds, as well as potentially tackling some
known issues with ElasticBeanstalk(#343, #344) and deletion of CloudFormation
stacks(#332).

It may also trick the ASG into running the start lifecycle hooks(#296), and
will immediately replace any scaled out instances, partially addressing #284.

The current/legacy replacement logic will still kick in when enabling
AutoSpotting for the first time on an existing group, as well as when spot
instances fail to be launched so the group must failover to on-demand capacity
in order to avoid downtime.

Once the legacy replacement logic is done on the current group, the group is
tagegd with a configuration tag that enables the event logic. This tag is also
set for all groups newer than an hour so their instances are automatically
replaced by the event-based replacement logic.

Current status
- changed regional CloudFormation stack to intercept EC2 instances
  as soon as entering the 'pending' and 'running' state'
- implement handling code for these events, that launches a replacement spot
  instance for each on-demand instance currently in pending state and performs
  the usual instance swapping logic when a spot instance enters the running
  state, without waiting for it to exit its grace period.
- largely kept the initial instance replacement logic as it is (waiting for
  the grace period) to avoid introducing regressions, and also in case the
  events are not properly propagated like in case of misconfigurations.
- it compiles and was executed against against real infrastructure and is
  working as expected for a couple of groups I tested it against.
- still missing test coverage for the new code and some existing tests were
  crashing and hat to be disabled for now
- implemented support for handling the AutoScaling lifecycle hooks when
  replacing new on-demand instances in event-based execution mode
- added support for running it locally against custom JSON event data in
  CloudFront Event format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#ec2_event_type

ToDo:
- test against more infrastructure
- iron out any crashes
- write automated tests for the new code and re-enable older test cases so
  that coverage doesn't drop
- port regional CloudFormation stack changes to the main template so
  it can also run from a StackSet
- update the documentation to mention this new behavior

* Further improvements

Only handle the "running" events of new instances because the instance
information wasn't filled while the instance was in pending state.

Also run spot attachment in parallel with detachment of the on-demand
instance in order to avoid race conditions with the group's process that
enforces the desired capacity like when doing multiple instances in parallel
for new groups or multiple instance launches.

* More improvements

- Suspend termination processes on ASG while swapping instances
- Handle launch lifecycle hooks of unattached spot instances by
  intercepting the API call failures from CloudTrail

* Fix CloudWatch event rule

* Allow flavor to be customised (#359)

* Fix typo (#361)

* Increase automated testing coverage

- Refactor the cron execution mode to support determining the action and
running the action as independent steps, which makes the code easier to
test.
- switch to Go 1.13
- Suspend the AutoScaling termination process for 5 minutes
- Reset the hourlySavings value between tests
- Improve mocking
- Print log statements during tests when the AUTOSPOTTING_DEBUG variable
is set to "true" when running tests in verbose mode
- change isProtectedFromTermination to return true in case of API
failures.

* Suspend termination processes for 5 minutes

* Feat/event based instance replacement (#371)

* Allow flavor to be customised (#359)

* fix typo (#360)

* Fix typo (#361)

* UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366)

* Support custom role for cfn-init in Beanstalk UserData

* Wrappers & refactoring

* Docs

* Docs fixes

* More docs fixes

* Docs fixes

* Yet more docs fixes

* Lambda & Kubernetes config

* AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers

* Update param description

* Kubernetes config fix

* Rename Beanstalk variable + Move test data

* Add missing permission for AutoSpottingBeanstalk role

* Bring Go to 1.13 (#367)

* merge fixes

* Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet

* Merge fix

* Fix to template

* Multiple fixes to handle lambda concurrency

As instaces can be launched in concurrency/overlapping we have the
problems related to multiple lambdas acting on the same ASG

* Removed commented code

* Begin working on:

#354 (comment)

* Progress #354

Created Queue in CF template
Begin sending message

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* * Use inline python lambda to increase ASG

Use it only if AttachInstances method fails for wrong ASG max size.

* * Use inline python lambda to increase ASG

progress

* * Use inline python lambda to increase ASG

progress

* * Improvements on event based instance replacement

- Ported to StackSet deploy mode.

- Fix to "ScalingActivityInProgress" if launched ondemand instance is
terminated before going "inservice":

	Created method "waitForInstanceStatus", that wait, until a max
	retry (5), that an instance belonging to an ASG is in the desired
	status (InService).
	(Sleep time is 5*retry)

- Fix to multiple problems caused by lambda concurrency changing ASG
MaxSize:
	Created another Lambda (LambdaManageASG) in the same region of
	the main one; code python3.7, inline in template.
	Lambda concurrency is set to one.
	Is function is to change ASG MaxSize by the ammount specified.

	Used a "try-catch approach":
	if AttachInstances return error code "ValidationError" and
	string "update the AutoScalingGroup sizes" is present in error
	message, means that we need to increase ASG MaxSize.
	So we execute method changeAutoScalingMaxSize that invoke
	LambdaManageASG.

	If invoke return error "ErrCodeTooManyRequestsException", means
	that multiple Main lambdas are executing LambdaManageASG, we
	sleep for a random interval part of seconds and retry.

	Method attachSpotInstance now return an int that represent the
	ammount of change to the ASG MaxSize, so that in
	swapWithGroupMember we can defer the call to
	changeAutoScalingMaxSize to decrease the ASG MaxSize of the
	previously increased ammount.

	We use "waitForInstanceStatus" in attachSpotInstance before
	returning to be sure that spot instance has been attached and
	are InService before beginning to terminate ondemand one.

* * Improvements on event based instance replacement

- Add logPrefix to better identify lambda actions in case of concurent
executions.
	TE = Spot terminate event
	ST = Instance start event
	SC = Schedule event

TE and ST are followed by instanceID, SC is followed by activation time.

* * Improvements on event based instance replacement

- fix/improvement on Suspending/Resuming Termination process:
suspend/resume is now handled by LambdaManageASG too.

When i successfully suspend termination, i add a tag
"autospotting_suspend_process_by" with value equals to the instanceId
 event that triggered the main lambda.
When i try to resume termination first check if the value of the tag
above equals the current one.
If not, means that another main lambda, different from the one who suspended
it, is trying to resume the process; in that case i do not resume it.

Considered that LambdaManageASG has a concurrent execution limit set to
1 we can have the following cases:

1)
  *) A main lambda suspend the process.
  *) No other lambda suspend it.
  *) Main lambda resume the process.
  *) Another main lambda suspend the process
  *) ....and so on

2)
  *) A main lambda suspend the process.
  *) Before the first one resume the process another one suspend it (so
  replacing the tag value)
  *) First lambda do not resume the process (tag value differ)
  *) Second lambda resume the process

In the wrost scenario of a Lambda dying before resuming the process, it
will be resumed after another one will suspend it.

* * Improvements on event based instance replacement

code cosmetic fix

* * Improvements on event based instance replacement

- for rand seed instead of using time.Now().UnixNano() we build a seed
based on the instanceId that triggered the event.

The seed is build this way:
for every char of instanceId (starting from third char) we get his rune
 "representation" and sum it to previous one.
We use it as a temporary seed and get a random number between 0 and 9.

The final seed is the concatenation of the generated random numbers.

This way we have a seed number (int64) that depend from the instanceId and
of the same lenght.

* * Improvements on event based instance replacement

- no more need to "revert attach/detach order when running on minimum
capacity".
Defer changeAutoScalingMaxSize and use same logic of
swapWithGroupMember.

* * Improvements on event based instance replacement

- Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot
too.

We use it as a trick to avoid rare cases of concurrency between
scheduled and event lambdas.
As lambda that handle "suspendResumeProcess" have a concurrency limit of
one, scheduled and event lambdas, if concurrent, will be "time shifted"
by a random value.
This way they will not execute attachSpotInstance at the same time.

In case of scheduled lambda we add the "S" char to the instanceId used for
the randSeed to avoid that it resume process suspended by event lambda.

- Fix in swapWithGroupMember:
defer asg.suspendResumeProcess for resume so that it will be executed
even if function swapWithGroupMember return error.

* * gofmt cosmetic changes

* Fix to rand.Intn parameter to include even 9

Co-authored-by: Chris Farrenden <cfarrend@gmail.com>
Co-authored-by: Rubi <14269809+codenoid@users.noreply.github.com>
Co-authored-by: Jawad <jawad.stouli@gmail.com>
Co-authored-by: Gábor Lipták <gliptak@gmail.com>

* Feat/event based instance replacement - Fix to terminateRandomSpotInstance logic in Schedule run + specify ondemandprice multiplier on a ASG group level (#425)

* Allow flavor to be customised (#359)

* fix typo (#360)

* Fix typo (#361)

* UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366)

* Support custom role for cfn-init in Beanstalk UserData

* Wrappers & refactoring

* Docs

* Docs fixes

* More docs fixes

* Docs fixes

* Yet more docs fixes

* Lambda & Kubernetes config

* AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers

* Update param description

* Kubernetes config fix

* Rename Beanstalk variable + Move test data

* Add missing permission for AutoSpottingBeanstalk role

* Bring Go to 1.13 (#367)

* merge fixes

* Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet

* Merge fix

* Fix to template

* Multiple fixes to handle lambda concurrency

As instaces can be launched in concurrency/overlapping we have the
problems related to multiple lambdas acting on the same ASG

* Removed commented code

* Begin working on:

#354 (comment)

* Progress #354

Created Queue in CF template
Begin sending message

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* * Use inline python lambda to increase ASG

Use it only if AttachInstances method fails for wrong ASG max size.

* * Use inline python lambda to increase ASG

progress

* * Use inline python lambda to increase ASG

progress

* * Improvements on event based instance replacement

- Ported to StackSet deploy mode.

- Fix to "ScalingActivityInProgress" if launched ondemand instance is
terminated before going "inservice":

	Created method "waitForInstanceStatus", that wait, until a max
	retry (5), that an instance belonging to an ASG is in the desired
	status (InService).
	(Sleep time is 5*retry)

- Fix to multiple problems caused by lambda concurrency changing ASG
MaxSize:
	Created another Lambda (LambdaManageASG) in the same region of
	the main one; code python3.7, inline in template.
	Lambda concurrency is set to one.
	Is function is to change ASG MaxSize by the ammount specified.

	Used a "try-catch approach":
	if AttachInstances return error code "ValidationError" and
	string "update the AutoScalingGroup sizes" is present in error
	message, means that we need to increase ASG MaxSize.
	So we execute method changeAutoScalingMaxSize that invoke
	LambdaManageASG.

	If invoke return error "ErrCodeTooManyRequestsException", means
	that multiple Main lambdas are executing LambdaManageASG, we
	sleep for a random interval part of seconds and retry.

	Method attachSpotInstance now return an int that represent the
	ammount of change to the ASG MaxSize, so that in
	swapWithGroupMember we can defer the call to
	changeAutoScalingMaxSize to decrease the ASG MaxSize of the
	previously increased ammount.

	We use "waitForInstanceStatus" in attachSpotInstance before
	returning to be sure that spot instance has been attached and
	are InService before beginning to terminate ondemand one.

* * Improvements on event based instance replacement

- Add logPrefix to better identify lambda actions in case of concurent
executions.
	TE = Spot terminate event
	ST = Instance start event
	SC = Schedule event

TE and ST are followed by instanceID, SC is followed by activation time.

* * Improvements on event based instance replacement

- fix/improvement on Suspending/Resuming Termination process:
suspend/resume is now handled by LambdaManageASG too.

When i successfully suspend termination, i add a tag
"autospotting_suspend_process_by" with value equals to the instanceId
 event that triggered the main lambda.
When i try to resume termination first check if the value of the tag
above equals the current one.
If not, means that another main lambda, different from the one who suspended
it, is trying to resume the process; in that case i do not resume it.

Considered that LambdaManageASG has a concurrent execution limit set to
1 we can have the following cases:

1)
  *) A main lambda suspend the process.
  *) No other lambda suspend it.
  *) Main lambda resume the process.
  *) Another main lambda suspend the process
  *) ....and so on

2)
  *) A main lambda suspend the process.
  *) Before the first one resume the process another one suspend it (so
  replacing the tag value)
  *) First lambda do not resume the process (tag value differ)
  *) Second lambda resume the process

In the wrost scenario of a Lambda dying before resuming the process, it
will be resumed after another one will suspend it.

* * Improvements on event based instance replacement

code cosmetic fix

* * Improvements on event based instance replacement

- for rand seed instead of using time.Now().UnixNano() we build a seed
based on the instanceId that triggered the event.

The seed is build this way:
for every char of instanceId (starting from third char) we get his rune
 "representation" and sum it to previous one.
We use it as a temporary seed and get a random number between 0 and 9.

The final seed is the concatenation of the generated random numbers.

This way we have a seed number (int64) that depend from the instanceId and
of the same lenght.

* * Improvements on event based instance replacement

- no more need to "revert attach/detach order when running on minimum
capacity".
Defer changeAutoScalingMaxSize and use same logic of
swapWithGroupMember.

* * Improvements on event based instance replacement

- Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot
too.

We use it as a trick to avoid rare cases of concurrency between
scheduled and event lambdas.
As lambda that handle "suspendResumeProcess" have a concurrency limit of
one, scheduled and event lambdas, if concurrent, will be "time shifted"
by a random value.
This way they will not execute attachSpotInstance at the same time.

In case of scheduled lambda we add the "S" char to the instanceId used for
the randSeed to avoid that it resume process suspended by event lambda.

- Fix in swapWithGroupMember:
defer asg.suspendResumeProcess for resume so that it will be executed
even if function swapWithGroupMember return error.

* * gofmt cosmetic changes

* Fix to rand.Intn parameter to include even 9

* fix terminateRandomSpotInstanceIfHavingEnough

need another condition to avoid terminating valid spot instance in case ASG have minOnDemand > 0

if all ASG instances are in state running and
  Min OnDemand instance equals total ondemand running and
  all instances running equals desired capacity
means that i do not need to terminate a spot instance

need some testing

* fix terminateRandomSpotInstanceIfHavingEnough

fixes

* fix terminateRandomSpotInstanceIfHavingEnough

changed allInstancesRunning to return ondemand instances running too

* Specify/override multiplier for the on-demand price on a group
level

As i.price already have been multiplied by the global value, if
specified, i need first to divide it by the same value and then multiply
it by the multiplier specific to the ASG.

We need to do this for both scheduled and event actions.
For event we act in function belongsToEnabledASG.
For schedule we act in launchSpotReplacement.

Need deep testing...

* Specify/override multiplier for the on-demand price on a group
    level - fix

in loadConfOnDemandPriceMultiplier need to use
a.config.OnDemandPriceMultiplier in place of
a.region.conf.OnDemandPriceMultiplier

this way a.region.conf.OnDemandPriceMultiplier will conserve the
original global value

* Misc fixes to be able to run go tests

* Misc fixes to be able to run go tests - continue

* Misc fixes to be able to run go tests - end for now

* For Test_autoScalingGroup_terminateRandomSpotInstanceIfHavingEnough
added contion:
"spot capacity is correct, skip termination"

Co-authored-by: Chris <chris.farrenden@domain.com.au>
Co-authored-by: 0x11 <14269809+codenoid@users.noreply.github.com>
Co-authored-by: Jawad <jawad.stouli@gmail.com>
Co-authored-by: Gábor Lipták <gliptak@gmail.com>

* Feat/event based instance replacement - Merged all "new" commit from master (#433)

* Allow flavor to be customised (#359)

* fix typo (#360)

* Fix typo (#361)

* UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366)

* Support custom role for cfn-init in Beanstalk UserData

* Wrappers & refactoring

* Docs

* Docs fixes

* More docs fixes

* Docs fixes

* Yet more docs fixes

* Lambda & Kubernetes config

* AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers

* Update param description

* Kubernetes config fix

* Rename Beanstalk variable + Move test data

* Add missing permission for AutoSpottingBeanstalk role

* Bring Go to 1.13 (#367)

* merge fixes

* Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet

* Merge fix

* Fix to template

* Multiple fixes to handle lambda concurrency

As instaces can be launched in concurrency/overlapping we have the
problems related to multiple lambdas acting on the same ASG

* Removed commented code

* Begin working on:

#354 (comment)

* Progress #354

Created Queue in CF template
Begin sending message

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* * Use inline python lambda to increase ASG

Use it only if AttachInstances method fails for wrong ASG max size.

* * Use inline python lambda to increase ASG

progress

* * Use inline python lambda to increase ASG

progress

* * Improvements on event based instance replacement

- Ported to StackSet deploy mode.

- Fix to "ScalingActivityInProgress" if launched ondemand instance is
terminated before going "inservice":

	Created method "waitForInstanceStatus", that wait, until a max
	retry (5), that an instance belonging to an ASG is in the desired
	status (InService).
	(Sleep time is 5*retry)

- Fix to multiple problems caused by lambda concurrency changing ASG
MaxSize:
	Created another Lambda (LambdaManageASG) in the same region of
	the main one; code python3.7, inline in template.
	Lambda concurrency is set to one.
	Is function is to change ASG MaxSize by the ammount specified.

	Used a "try-catch approach":
	if AttachInstances return error code "ValidationError" and
	string "update the AutoScalingGroup sizes" is present in error
	message, means that we need to increase ASG MaxSize.
	So we execute method changeAutoScalingMaxSize that invoke
	LambdaManageASG.

	If invoke return error "ErrCodeTooManyRequestsException", means
	that multiple Main lambdas are executing LambdaManageASG, we
	sleep for a random interval part of seconds and retry.

	Method attachSpotInstance now return an int that represent the
	ammount of change to the ASG MaxSize, so that in
	swapWithGroupMember we can defer the call to
	changeAutoScalingMaxSize to decrease the ASG MaxSize of the
	previously increased ammount.

	We use "waitForInstanceStatus" in attachSpotInstance before
	returning to be sure that spot instance has been attached and
	are InService before beginning to terminate ondemand one.

* * Improvements on event based instance replacement

- Add logPrefix to better identify lambda actions in case of concurent
executions.
	TE = Spot terminate event
	ST = Instance start event
	SC = Schedule event

TE and ST are followed by instanceID, SC is followed by activation time.

* * Improvements on event based instance replacement

- fix/improvement on Suspending/Resuming Termination process:
suspend/resume is now handled by LambdaManageASG too.

When i successfully suspend termination, i add a tag
"autospotting_suspend_process_by" with value equals to the instanceId
 event that triggered the main lambda.
When i try to resume termination first check if the value of the tag
above equals the current one.
If not, means that another main lambda, different from the one who suspended
it, is trying to resume the process; in that case i do not resume it.

Considered that LambdaManageASG has a concurrent execution limit set to
1 we can have the following cases:

1)
  *) A main lambda suspend the process.
  *) No other lambda suspend it.
  *) Main lambda resume the process.
  *) Another main lambda suspend the process
  *) ....and so on

2)
  *) A main lambda suspend the process.
  *) Before the first one resume the process another one suspend it (so
  replacing the tag value)
  *) First lambda do not resume the process (tag value differ)
  *) Second lambda resume the process

In the wrost scenario of a Lambda dying before resuming the process, it
will be resumed after another one will suspend it.

* * Improvements on event based instance replacement

code cosmetic fix

* * Improvements on event based instance replacement

- for rand seed instead of using time.Now().UnixNano() we build a seed
based on the instanceId that triggered the event.

The seed is build this way:
for every char of instanceId (starting from third char) we get his rune
 "representation" and sum it to previous one.
We use it as a temporary seed and get a random number between 0 and 9.

The final seed is the concatenation of the generated random numbers.

This way we have a seed number (int64) that depend from the instanceId and
of the same lenght.

* * Improvements on event based instance replacement

- no more need to "revert attach/detach order when running on minimum
capacity".
Defer changeAutoScalingMaxSize and use same logic of
swapWithGroupMember.

* * Improvements on event based instance replacement

- Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot
too.

We use it as a trick to avoid rare cases of concurrency between
scheduled and event lambdas.
As lambda that handle "suspendResumeProcess" have a concurrency limit of
one, scheduled and event lambdas, if concurrent, will be "time shifted"
by a random value.
This way they will not execute attachSpotInstance at the same time.

In case of scheduled lambda we add the "S" char to the instanceId used for
the randSeed to avoid that it resume process suspended by event lambda.

- Fix in swapWithGroupMember:
defer asg.suspendResumeProcess for resume so that it will be executed
even if function swapWithGroupMember return error.

* * gofmt cosmetic changes

* Fix to rand.Intn parameter to include even 9

* fix terminateRandomSpotInstanceIfHavingEnough

need another condition to avoid terminating valid spot instance in case ASG have minOnDemand > 0

if all ASG instances are in state running and
  Min OnDemand instance equals total ondemand running and
  all instances running equals desired capacity
means that i do not need to terminate a spot instance

need some testing

* fix terminateRandomSpotInstanceIfHavingEnough

fixes

* fix terminateRandomSpotInstanceIfHavingEnough

changed allInstancesRunning to return ondemand instances running too

* Specify/override multiplier for the on-demand price on a group
level

As i.price already have been multiplied by the global value, if
specified, i need first to divide it by the same value and then multiply
it by the multiplier specific to the ASG.

We need to do this for both scheduled and event actions.
For event we act in function belongsToEnabledASG.
For schedule we act in launchSpotReplacement.

Need deep testing...

* Specify/override multiplier for the on-demand price on a group
    level - fix

in loadConfOnDemandPriceMultiplier need to use
a.config.OnDemandPriceMultiplier in place of
a.region.conf.OnDemandPriceMultiplier

this way a.region.conf.OnDemandPriceMultiplier will conserve the
original global value

* Misc fixes to be able to run go tests

* Misc fixes to be able to run go tests - continue

* Misc fixes to be able to run go tests - end for now

* For Test_autoScalingGroup_terminateRandomSpotInstanceIfHavingEnough
added contion:
"spot capacity is correct, skip termination"

* Merge Updating AWS SDK to latest version - [0ee95c8]

* Merge Relicense to OSL-3 - [972bc61]

* Merge Update readme, mention the relicensing to OSL-3 - [9b438dc]

* Merge Fix `make archive` on macOS [c929391]

* Merge Move some logs to debug [fa0c27b]

* Merge Delete old Gopkg files [5504e34]

* Merge Update dependencies [0e14a7b]

* Merge Support spot price buffer percentage of 0 [26ca955]

* Merge Don't segfault when spot instance doesn't belong to ASG [f313c6d]

* Merge Allow specifying GOOS and GOARCH [8d67d6e]

* Merge Use /bin/bash for shell [0eed8cd]

* Merge Ignore terminating spot instances that don't belong to
AutoSpotting (master) and Enable Spot Termination ASG Checking (event) [33a444c]

* Merge Revert Use /bin/bash for shell [6e45440]

* Merge Pass --abort-on-container-exit to docker-compose [8f90fba]

* Merge Rename travisci make targets to ci [2d76a9a]

* Merge Create FUNDING.yml + Update FUNDING.yml [dde6d85,ca81828]

* Merge Use paginated version of DescribeSpotPriceHistory [9a770b3]

* Merge Delete DescribeSecurityGroups mock [3bfd542]

* Merge Move config loading out of main and add tests for it [30a4392]

* Fixes to Merge Move config loading out of main and add tests for it

* Merge Move logs about incompatible instance types to debug [985d675]

* Merge Remove incorrect Makefile conditionals [a01ee26]

* Merge Actually fail the build if gofmt fails [c391d69]

* Merge Add tools to go.mod [8a9a90b]

* Merge No longer enforce the name of the ElasticBeanstalk IAM policy [3313298]

* Merge Added spot premium [ecf31a5]

* Merge Update how bid price is calculated for premium instances [8d13fb0]

* Merge Update dependencies [caf373d]

* Merge Cron timezone [8e04942]

* Merge Spleling [a37aafc]

* Merge Update README.md [0109a0b,fb15aa4,4e03db7]

* Merge Use the larger of min-OD-instances and (min-OD-percent * current)
[25ac31f]

* gofmt changes

* cronEventAction - fix to logic

need to invert test for needReplaceOnDemandInstances and
onDemandInstance == nil

currently if onDemandInstance == nil methods return and execution stop.
This way the check to terminate a spotInstace if their number is more
than required is never done.

Assume that ASG scale down and terminate the onDemand instances that
autospotting is not terminating [autospotting_min_on_demand_number].
On the next runs onDemandInstance will be nil and spot instances in
excess will not be terminated.

Co-authored-by: Chris <chris.farrenden@domain.com.au>
Co-authored-by: 0x11 <14269809+codenoid@users.noreply.github.com>
Co-authored-by: Jawad <jawad.stouli@gmail.com>
Co-authored-by: Gábor Lipták <gliptak@gmail.com>

* Feat/event based instance replacement (#434)

* Allow flavor to be customised (#359)

* fix typo (#360)

* Fix typo (#361)

* UserData wrappers for CloudFormation helper scripts when using Beanstalk (#366)

* Support custom role for cfn-init in Beanstalk UserData

* Wrappers & refactoring

* Docs

* Docs fixes

* More docs fixes

* Docs fixes

* Yet more docs fixes

* Lambda & Kubernetes config

* AutoSpottingElasticBeanstalk managed policy + Rename beanstalk_cfn_init_role to beanstalk_cfn_wrappers

* Update param description

* Kubernetes config fix

* Rename Beanstalk variable + Move test data

* Add missing permission for AutoSpottingBeanstalk role

* Bring Go to 1.13 (#367)

* merge fixes

* Begin - port regional CloudFormation stack changes to the main template so it can also run from a StackSet

* Merge fix

* Fix to template

* Multiple fixes to handle lambda concurrency

As instaces can be launched in concurrency/overlapping we have the
problems related to multiple lambdas acting on the same ASG

* Removed commented code

* Begin working on:

#354 (comment)

* Progress #354

Created Queue in CF template
Begin sending message

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* Progress #354

* * Use inline python lambda to increase ASG

Use it only if AttachInstances method fails for wrong ASG max size.

* * Use inline python lambda to increase ASG

progress

* * Use inline python lambda to increase ASG

progress

* * Improvements on event based instance replacement

- Ported to StackSet deploy mode.

- Fix to "ScalingActivityInProgress" if launched ondemand instance is
terminated before going "inservice":

	Created method "waitForInstanceStatus", that wait, until a max
	retry (5), that an instance belonging to an ASG is in the desired
	status (InService).
	(Sleep time is 5*retry)

- Fix to multiple problems caused by lambda concurrency changing ASG
MaxSize:
	Created another Lambda (LambdaManageASG) in the same region of
	the main one; code python3.7, inline in template.
	Lambda concurrency is set to one.
	Is function is to change ASG MaxSize by the ammount specified.

	Used a "try-catch approach":
	if AttachInstances return error code "ValidationError" and
	string "update the AutoScalingGroup sizes" is present in error
	message, means that we need to increase ASG MaxSize.
	So we execute method changeAutoScalingMaxSize that invoke
	LambdaManageASG.

	If invoke return error "ErrCodeTooManyRequestsException", means
	that multiple Main lambdas are executing LambdaManageASG, we
	sleep for a random interval part of seconds and retry.

	Method attachSpotInstance now return an int that represent the
	ammount of change to the ASG MaxSize, so that in
	swapWithGroupMember we can defer the call to
	changeAutoScalingMaxSize to decrease the ASG MaxSize of the
	previously increased ammount.

	We use "waitForInstanceStatus" in attachSpotInstance before
	returning to be sure that spot instance has been attached and
	are InService before beginning to terminate ondemand one.

* * Improvements on event based instance replacement

- Add logPrefix to better identify lambda actions in case of concurent
executions.
	TE = Spot terminate event
	ST = Instance start event
	SC = Schedule event

TE and ST are followed by instanceID, SC is followed by activation time.

* * Improvements on event based instance replacement

- fix/improvement on Suspending/Resuming Termination process:
suspend/resume is now handled by LambdaManageASG too.

When i successfully suspend termination, i add a tag
"autospotting_suspend_process_by" with value equals to the instanceId
 event that triggered the main lambda.
When i try to resume termination first check if the value of the tag
above equals the current one.
If not, means that another main lambda, different from the one who suspended
it, is trying to resume the process; in that case i do not resume it.

Considered that LambdaManageASG has a concurrent execution limit set to
1 we can have the following cases:

1)
  *) A main lambda suspend the process.
  *) No other lambda suspend it.
  *) Main lambda resume the process.
  *) Another main lambda suspend the process
  *) ....and so on

2)
  *) A main lambda suspend the process.
  *) Before the first one resume the process another one suspend it (so
  replacing the tag value)
  *) First lambda do not resume the process (tag value differ)
  *) Second lambda resume the process

In the wrost scenario of a Lambda dying before resuming the process, it
will be resumed after another one will suspend it.

* * Improvements on event based instance replacement

code cosmetic fix

* * Improvements on event based instance replacement

- for rand seed instead of using time.Now().UnixNano() we build a seed
based on the instanceId that triggered the event.

The seed is build this way:
for every char of instanceId (starting from third char) we get his rune
 "representation" and sum it to previous one.
We use it as a temporary seed and get a random number between 0 and 9.

The final seed is the concatenation of the generated random numbers.

This way we have a seed number (int64) that depend from the instanceId and
of the same lenght.

* * Improvements on event based instance replacement

- no more need to "revert attach/detach order when running on minimum
capacity".
Defer changeAutoScalingMaxSize and use same logic of
swapWithGroupMember.

* * Improvements on event based instance replacement

- Use suspendResumeProcess for schedule replaceOnDemandInstanceWithSpot
too.

We use it as a trick to avoid rare cases of concurrency between
scheduled and event lambdas.
As lambda that handle "suspendResumeProcess" have a concurrency limit of
one, scheduled and event lambdas, if concurrent, will be "time shifted"
by a random value.
This way they will not execute attachSpotInstance at the same time.

In case of scheduled lambda we add the "S" char to the instanceId used for
the randSeed to avoid that it resume process suspended by event lambda.

- Fix in swapWithGroupMember:
defer asg.suspendResumeProcess for resume so that it will be executed
even if function swapWithGroupMember return error.

* * gofmt cosmetic changes

* Fix to rand.Intn parameter to include even 9

* fix terminateRandomSpotInstanceIfHavingEnough

need another condition to avoid terminating valid spot instance in case ASG have minOnDemand > 0

if all ASG instances are in state running and
  Min OnDemand instance equals total ondemand running and
  all instances running equals desired capacity
means that i do not need to terminate a spot instance

need some testing

* fix terminateRandomSpotInstanceIfHavingEnough

fixes

* fix terminateRandomSpotInstanceIfHavingEnough

changed allInstancesRunning to return ondemand instances running too

* Specify/override multiplier for the on-demand price on a group
level

As i.price already have been multiplied by the global value, if
specified, i need first to divide it by the same value and then multiply
it by the multiplier specific to the ASG.

We need to do this for both scheduled and event actions.
For event we act in function belongsToEnabledASG.
For schedule we act in launchSpotReplacement.

Need deep testing...

* Specify/override multiplier for the on-demand price on a group
    level - fix

in loadConfOnDemandPriceMultiplier need to use
a.config.OnDemandPriceMultiplier in place of
a.region.conf.OnDemandPriceMultiplier

this way a.region.conf.OnDemandPriceMultiplier will conserve the
original global value

* Misc fixes to be able to run go tests

* Misc fixes to be able to run go tests - continue

* Misc fixes to be able to run go tests - end for now

* For Test_autoScalingGroup_terminateRandomSpotInstanceIfHavingEnough
added contion:
"spot capacity is correct, skip termination"

* Merge Updating AWS SDK to latest version - [0ee95c8]

* Merge Relicense to OSL-3 - [972bc61]

* Merge Update readme, mention the relicensing to OSL-3 - [9b438dc]

* Merge Fix `make archive` on macOS [c929391]

* Merge Move some logs to debug [fa0c27b]

* Merge Delete old Gopkg files [5504e34]

* Merge Update dependencies [0e14a7b]

* Merge Support spot price buffer percentage of 0 [26ca955]

* Merge Don't segfault when spot instance doesn't belong to ASG [f313c6d]

* Merge Allow specifying GOOS and GOARCH [8d67d6e]

* Merge Use /bin/bash for shell [0eed8cd]

* Merge Ignore terminating spot instances that don't belong to
AutoSpotting (master) and Enable Spot Termination ASG Checking (event) [33a444c]

* Merge Revert Use /bin/bash for shell [6e45440]

* Merge Pass --abort-on-container-exit to docker-compose [8f90fba]

* Merge Rename travisci make targets to ci [2d76a9a]

* Merge Create FUNDING.yml + Update FUNDING.yml [dde6d85,ca81828]

* Merge Use paginated version of DescribeSpotPriceHistory [9a770b3]

* Merge Delete DescribeSecurityGroups mock [3bfd542]

* Merge Move config loading out of main and add tests for it [30a4392]

* Fixes to Merge Move config loading out of main and add tests for it

* Merge Move logs about incompatible instance types to debug [985d675]

* Merge Remove incorrect Makefile conditionals [a01ee26]

* Merge Actually fail the build if gofmt fails [c391d69]

* Merge Add tools to go.mod [8a9a90b]

* Merge No longer enforce the name of the ElasticBeanstalk IAM policy [3313298]

* Merge Added spot premium [ecf31a5]

* Merge Update how bid price is calculated for premium instances [8d13fb0]

* Merge Update dependencies [caf373d]

* Merge Cron timezone [8e04942]

* Merge Spleling [a37aafc]

* Merge Update README.md [0109a0b,fb15aa4,4e03db7]

* Merge Use the larger of min-OD-instances and (min-OD-percent * current)
[25ac31f]

* gofmt changes

* cronEventAction - fix to logic

need to invert test for needReplaceOnDemandInstances and
onDemandInstance == nil

currently if onDemandInstance == nil methods return and execution stop.
This way the check to terminate a spotInstace if their number is more
than required is never done.

Assume that ASG scale down and terminate the onDemand instances that
autospotting is not terminating [autospotting_min_on_demand_number].
On the next runs onDemandInstance will be nil and spot instances in
excess will not be terminated.

* terminateUnneededSpotInstance - no need to call
asg.terminateRandomSpotInstanceIfHavingEnough

There is no need to call asg.terminateRandomSpotInstanceIfHavingEnough
in terminateUnneededSpotInstance.
We already execute spotInstance.terminate, on the unattached spot
instance.

* terminateInstanceInAutoScalingGroup is used for spot too

As terminateInstanceInAutoScalingGroup is used also for terminating spot
instances, changed comment and logging message.

* Avoid adding instance in Terminating Lifecycle State

Problem arise if ASG have a LifeCycle Hook on terminating instances and
the lifecycle action take more time than the Autospotting schedule.

When terminating instances LifeCycle Hook (LH) put the instance in the
lifecycle state Terminating:Wait (and later Terminating:Proceed).
But during that time, until LH complete the instance status is still running.

So AutoSpotting when counting running instances will consider that
instance too.

This will cause multiples problems:
* spot instance can be terminated even if needed
* instance can be terminated even if already in terminating state
* spot instance are launched even if not needed

The fix is very simple:
in "scanInstances" func do not execute "a.instances.add" if the instance
lifecycle begin with Terminating.

Co-authored-by: Chris <chris.farrenden@domain.com.au>
Co-authored-by: 0x11 <14269809+codenoid@users.noreply.github.com>
Co-authored-by: Jawad <jawad.stouli@gmail.com>
Co-authored-by: Gábor Lipták <gliptak@gmail.com>

* terminateUnneededSpotInstance - fix to total is unused (#435)

* Add missing import for ioutil

* Fix build and tests

* Fix linter errors reported by `make ci-docker`

* Address a few codeclimate issues

Co-authored-by: Chris <chris.farrenden@domain.com.au>
Co-authored-by: Cristian Magherusan-Stanciu <cmagh@amazon.com>
Co-authored-by: mello7tre <mello+github@ankot.org>
Co-authored-by: Chris Farrenden <cfarrend@gmail.com>
Co-authored-by: Rubi <14269809+codenoid@users.noreply.github.com>
Co-authored-by: Jawad <jawad.stouli@gmail.com>
Co-authored-by: Gábor Lipták <gliptak@gmail.com>
af0fec8

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

AutoSpotting

BuildStatus GoReportCard CoverageStatus CodeClimate IssueCount ChatOnGitter Open Source Helpers Patreon

A simple and easy to use tool designed to significantly lower your Amazon AWS costs by automating the use of spot instances.

Why?

We believe that AWS EC2 is often pricier than it should be, and that the pricing models that can significantly reduce costs are hard to be reliably used by humans and are better handled by automation.

There are many way to automate the use of spot instances, some offered out of the box by AWS and some from third parties, each with its own characteristics and drawbacks.

Unlike all those options, we developed a novel, simple but effective way to make it much easier to convert existing infrastructure within minutes, with minimal configuration changes, negligible additional infrastructure and runtime costs, safely and securely and without any vendor lock-in.

Note: it's not necessarily the most cost-efficient for a given AutoScaling group, although it does perform pretty well in practice. The main focus is the ease of adoption on large infrastructure such as environments with hundreds of AutoScaling groups or even hundreds of AWS accounts where it can be rolled out and enabled in almost no-time compared to any other tool out there. Once you tested it and you are confident with it, it can even be enabled against all the groups from an AWS account without touching their configuration.

It also tries to be as cheap as possible to run, with negligible runtime costs and being open source the software is free of charge if you build it from source. In addition we offer inexpensive enterprise-grade support plans that should barely be noticeable at scale, but just enough to support further development, unlike similar commercial tools that claim a very significant chunk of your savings.

This approach allows a large number of companies and individuals to significantly reduce their infrastructure costs or get more bang for the same buck. They can now easily get access to cheap compute capacity so they can spend their scarce resources developing innovative products not paying for overpriced compute capacity.

How does it work?

Once installed and enabled by tagging existing on-demand AutoScaling groups, AutoSpotting gradually replaces their on-demand instances with spot instances that are usually much cheaper, at least as large and identically configured to the group's members, without changing the group configuration in any way. For your peace of mind, you can also keep running a configurable number of on-demand instances given as percentage or absolute number.

This can be seen in action below, you can click to expand the animation:

Workflow

It implements some complex logic aware of spot and on demand prices, including for different spot products and configurable discounts for reserved instances or large volume customers. It also considers the specs of all instance types and automatically places bids to instance types and prices chosen based on flexible configuration set globally or overridden at the group level using additional tags, but these overrides are often not needed.

A single installation can handle all enabled groups in parallel across all available AWS regions, but can be restricted to fewer regions if desired.

Your groups will then monitor and use these spot instances just like they would do with your on-demand instances. They will automatically join your load balancer and start receiving traffic once passing the health checks, and the traffic would automatically be drained on termination.

What savings can I expect?

The savings it generates are often in the 60-80% range, but sometimes even up to 90%, like you can see in the graph below.

Savings

What's under the hood?

The entire logic described above is implemented in a Lambda function deployed using CloudFormation or Terraform stacks that can be installed and configured in just a few minutes.

The stack assigns the function the minimal set of IAM permissions required for it to work and requires no admin-like cross-account permissions. The entire code base can be audited to see how these permissions are being used and even locked down further if your audit discovers any issues. This is not a SaaS, there's no component that calls home and reveals any details about your infrastructure.

The Lambda function is written in the Go programming language and the code is compiled as a static binary compressed and uploaded to S3. For evaluation or debugging purposes, the same binary can run out of the box locally on Linux machines or as a Docker container on Windows or macOS. Some people even run these containers on their existing Kubernetes clusters assuming the other resources provided by the stack are implemented in another way on Kubernetes.

The stack also consists of a Cron-like CloudWatch event, that runs the Lambda function periodically to take action against the enabled groups. Between runs your group is entirely managed by AutoScaling (including any scaling policies you may have) and load balancer health checks, that can trigger instance launches or replacements using the original on-demand launch configuration. These instances will be replaced later by better priced spot instances when they are available on the spot market.

Read here for more information and implementation details.

FAQs

Frequently asked questions about the project are answered in the FAQ, please read this first before asking for support.

If you have additional questions not covered there, they can be easily added to the source of the FAQ by editing in the browser and creating a pull request, and we'll answer them while reviewing the pull request.

Getting Started

Just like in the above animation, it's as easy as launching a CloudFormation (or Terraform) stack and setting the (configurable) spot-enabled tag on the AutoScaling groups where you want it enabled to true.

All the required infrastructure and configuration will be created automatically, so you can get started as fast as possible.

For more detailed information you can read this document

Launch

Note

  • the binaries launched by this stack are distributed under a proprietary license, and are free to use for evaluation, up to $1000 monthly savings. Once you reach this limit you'll need to either switch to the inexpensive supported binaries (designed to cost a small fraction of around 1% of your total savings for supporting further development), or you can build your own binaries based on the open source code and run it for free.

Support

Community support is available on the gitter chat room, where the main authors and other users are likely to help you solve issues.

Note: This is offered on a best effort basis and under certain conditions, such as using the latest version of the evaluation binaries.

If you need more comprehensive support you will need to purchase a support plan.

Contributing

Unlike multiple commercial products in this space that cost a lot of money and attempt to lock you in, this project is fully open source and developed in the open by a vibrant community.

We urge you to support us on Github Sponsors if this software helps you save any significant amount of money, this will greatly help further development.

Financial sponsorship is voluntary, it's also fine if you just try it out and give feedback, report issues, improve the documentation, write some code or assign a developer to work on it, or even just spread the word among your peers who might be interested in it. Any sort of support would be greatly appreciated and would make a huge difference to the project.

Note: Non-trivial code should be submitted according to the contribution guidelines.

Proprietary binaries

The source code is and will always be open source, so you can build and run it yourself, see how it works and even enhance it if you want.

But if you want to conveniently get started or update within minutes without setting up and maintaining a build environment or any additional infrastructure, we have pre-built evaluation binaries that will save you significant amounts of time and effort.

These can be used for evaluation purposes as long as the generated monthly savings are less than $1000. Once you reach this level you will need to either purchase an inexpensive stable build that doesn't have this limitation, and also comes with a support plan, or you can build AutoSpotting from source code.

The support license costs vary by group, region and AWS account coverage and can also be paid through Patreon.

Individuals and companies supporting the development of the open source code get free of charge support and stable build access for a year since their latest contribution to the project.

Note:

  • even though these evaluation builds are usually stable enough, they may not have been thoroughly tested yet and come with best effort community support.
  • the docker images available on DockerHub are also distributed under the same binary license and the costs are the same.

Stable builds

Carefully tested builds suitable for Enterprise use will be communicated to Patreon backers as soon as they join.

They come with support from the author, who will do his best to help you successfully run AutoSpotting on your environment so you can get the most out of it. The feature requests and issues will also be prioritized based on the Patreon tier.

Please get in touch on gitter if you have any questions about these stable builds.

Compiling and Installing

It is recommended to use the evaluation or stable binaries, which are easy to install, support further development of the software and allow you to get support.

But if you have some special needs that require some customizations or you don't want to rely on the author's infrastructure or contribute anything for longer term use of the software, you can always build and run your customized binaries that you maintain on your own, just keep in mind that those won't be supported in any way.

More details are available here

Users

Autospotting is already used by hundreds of individuals and organizations around the world, and we estimate to already save them in the millions of dollars monthly. Some of them we know of are mentioned in the list of notable users.

The following deserve a special mention for contributing significantly to the development effort (listed in alphabetical order):

License

This software is distributed under the terms of the OSL-3.0 license.

The official binaries are licensed under this proprietary license.

About

Saves up to 90% of AWS EC2 costs by automating the use of spot instances on existing AutoScaling groups. Installs in minutes using CloudFormation or Terraform. Convenient to deploy at scale using StackSets. Uses tagging to avoid launch configuration changes. Automated spot termination handling. Reliable fallback to on-demand instances.

Topics

Resources

License

Releases

No releases published

Sponsor this project

  •  
Learn more about GitHub Sponsors

Packages

No packages published
You can’t perform that action at this time.