今年夏季,絕不能錯過名勝壹號世界郵輪重回基隆啟航!多種優惠方案讓您輕鬆預訂心儀的日本沖繩郵輪行程,同時省下大筆開支!

AWS Morning Brief

10 個月前
-
-
(基於 PinQueue 指標)
AWS Morning Brief
The latest in AWS news, sprinkled with snark. Posts about AWS come out over sixty times a day. We filter through it all to find the hidden gems, the community contributions--the stuff worth hearing about! Then we summarize it with snark and share it with you--minus the nonsense.
Mon, 19 Jun 2023 19:38:38 -0700
Confused DevOps Professional

Last week in security news: CloudFlare had a Confused Deputy Vulnerability, Moving Away from IAM Identity Center, AWS KMS now supports importing asymmetric and HMAC keys, and more!

Links:

Thu, 15 Jun 2023 03:00:00 -0700
A Hole in the S3 Buckets

Last week in security news: Thinkst Canary's Thinkstscapes, Multiple S3 Bucket Negligence Awards, Credit Card Payment Processing on AWS, and more!

Links:

Thu, 08 Jun 2023 03:00:00 -0700
17 Final Ways to Run Containers on AWS

AWS Morning Brief Extras edition for the week of June 7, 2023.


Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/17-final-ways-to-run-containers/


Never miss an episode


Help the show


Buy our merch


What's Corey up to?


Wed, 07 Jun 2023 07:30:00 -0700
The Wages of TLS

Last week in security news: Faster AWS cloud connections with TLS 1.3, Belkin is crappy in many ways, the Tool of the Week, and more!

Links:

Thu, 01 Jun 2023 03:00:00 -0700
Bad Behavior And Doing Things Right

Last week in security news: The ex-Ubiquiti engineer who stole a giant pile of their data gets a six year prison term, Bitbucket will be updating their SSH host keys, AWS Reported a GuardDuty Finding Issue, and more!

Links:

Thu, 25 May 2023 03:00:00 -0700
A Hidden Serverless Peril

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/a-hidden-serverless-peril


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 24 May 2023 07:30:00 -0700
SCPs Are Not For Me..s?

Last week in security news: Amazon CloudFront announces one-click security protections, SCPkit helps you manage your SCPs, A walk through AWS Verified Access policies, and more!

Links:

Thu, 18 May 2023 03:00:00 -0700
Humoring the Parenthetical

Last week in security news: Containing Compromised EC2 Credentials Without (Hopefully) Breaking Things, How to scan your AWS Lambda functions with Amazon Inspector, AWS IAM Actions, And More!

Links:

Thu, 11 May 2023 03:00:00 -0700
My 9 Favorite Things About AWS

AWS Morning Brief Extras edition for the week of May 10, 2023.


Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/9-things-I-love-about-aws


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 10 May 2023 07:30:00 -0700
A Quiet But Bad Week

Last week in security news: Tailscale now offers network flow logs, Google had a GhostToken flaw, AWS reported an issue with IAM supporting multiple MFA devices, and more!

Links:

Thu, 04 May 2023 03:00:00 -0700
Shrieking Like a Toddler

Last week in security news: Dealing with Ransomware in the Cloud, Pen Testing AWS, How to prioritize IAM Access Analyzer findings, and more!

Links:

Thu, 27 Apr 2023 03:00:00 -0700
Why AWS Might Be the Next Backbone Provider

AWS Morning Brief Extras edition for the week of April 26, 2023.

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/why-aws-might-be-the-next-backbone-provider


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 26 Apr 2023 07:30:00 -0700
Screwing Up the Messaging and Also the RSA Dates

Last week in security news: Creating an AWS Backup Account, Azure had another cross-tenant access vulnerability, Security Hub Hurts My Self-Esteem, and more!

Links:

Thu, 20 Apr 2023 03:00:00 -0700
"A Quiet Week" He Says, Tempting Fate

Last week in security news: Logging strategies for security incident response, A Department of Energy report shows some rather serious gaps in security monitoring, A dedicated repository of winners of the S3 Bucket Negligence Awards, and more!

Links:

Thu, 13 Apr 2023 03:00:00 -0700
LocalStack: Why Local Development for Cloud Workloads Makes Sense

AWS Morning Brief Extras edition for the week of April 12, 2023.


Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/localstack-why-local-development-for-cloud-workloads-makes-sense


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 12 Apr 2023 07:30:00 -0700
A Repository of AWS Customer Breaches

Last week in security news: Gain insights and knowledge at AWS re:Inforce 2023, InvalidClientTokenId, a repository of AWS customer breaches, and more!

Links:

Thu, 06 Apr 2023 03:00:00 -0700
GitHub's Bad Key Week

Last week in security news: Github accidentally published its RSA host keys for SSH, Automate IAM credential reports for large AWS Organizations, The Tool of the Week, and more!

Links:

Thu, 30 Mar 2023 03:00:00 -0700
S3 as an Eternal Service

AWS Morning Brief Extras edition for the week of March 29, 2023.


Want to give your ears a break and read this as an article? You’re looking for this link.https://www.lastweekinaws.com/blog/s3-as-an-eternal-service

Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 29 Mar 2023 07:30:00 -0700
Y'allbikey Configuration Guide

Last week in security news: The Many Ways to Access DynamoDB, a Yubikey configuration cheatsheet, and more!

Links:

Thu, 23 Mar 2023 03:00:00 -0700
AWS's Anti-Competitive Move Hidden in Plain Sight

AWS Morning Brief Extras edition for the week of March 15, 2023.

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/awss-anti-competitive-move-hidden-in-plain-sight/


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 15 Mar 2023 07:30:00 -0700
LastPass, LastHope, LostPass, LostHope

Last week in security news: Audit Log Wall of Shame, More info on the LastPass breach, the Tool of the Week, and more!

Links:

Thu, 09 Mar 2023 03:00:00 -0800
Corey Invades Seattle

Last week in security news: US Military emails leaked on an exposed server, How to monitor and query IAM resources at scale, the Tool of the Week, and more!

Links:

Thu, 02 Mar 2023 03:00:00 -0800
AWS is Asleep at the Lambda Wheel

AWS Morning Brief Extras edition for the week of March 1, 2023.

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/aws-is-asleep-at-the-lambda-wheel


Never miss an episode


Help the show


Buy our merch


What's Corey up to?


Wed, 01 Mar 2023 07:30:00 -0800
A Little Security for Everyone

Last week in security news: More security woes for Azure, the AWS Survival Kit, CloudGPT, and more!

Links:

Thu, 23 Feb 2023 03:00:00 -0800
Amazon's Snowball Edge Frustrates This User

AWS Morning Brief Extras edition for the week of February 22, 2023.

Want to give your ears a break and read this as an article? You’re looking for this link.https://www.lastweekinaws.com/blog/amazons-snowball-edge-frustrates-this-user

Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 22 Feb 2023 07:30:00 -0800
Technical Debt Cash-Out Refinance
Tue, 21 Feb 2023 03:00:00 -0800
Attacked S3s and Guilty Pleas

Last week in security news: Ubiquiti inside attacker pleads guilty, Wiz 2023 State of the Cloud report, the tool of the week, and more!

Links:

Thu, 16 Feb 2023 03:00:00 -0800
The Dumbest Dollars a Cloud Provider Can Make (Replay)

AWS Morning Brief Extras edition for the week of February 15, 2023.


Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/the-dumbest-dollars-a-cloud-provider-can-make/


Never miss an episode


Help the show


Buy our merch


What's Corey up to?


Wed, 15 Feb 2023 07:30:00 -0800
Wait did you say "Drone Manufacturer?!"

Links:

Thu, 09 Feb 2023 03:00:00 -0800
The AWS Community Isn't for Amazonians

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/the-aws-community-isnt-for-amazonians


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 08 Feb 2023 07:30:00 -0800
Azure Improves Slowly

Links:

Thu, 02 Feb 2023 03:00:00 -0800
S3 Encryption at Rest Does NOT Solve for Bucket Negligence

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/s3-encryption-at-rest-does-not-solve-for-bucket-negligence/


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 01 Feb 2023 07:30:00 -0800
Wait Did You Say Root API Keys?

Links:

Thu, 19 Jan 2023 03:00:00 -0800
Computers Checking Compliance Boxes

This episode is sponsored in part by the Google for Startups Cloud Program

Links:

Thu, 12 Jan 2023 03:00:00 -0800
Holiday Replay: Why I Turned Down an AWS Job Offer

This episode originally aired on October 13, 2021

Check out a related YouTube Video here: https://youtu.be/BCiUulzr9f8


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 28 Dec 2022 07:30:00 -0800
A Bunch of Vulnerabilities is Called an Embarrassment

Links:

Thu, 22 Dec 2022 03:00:00 -0800
Holiday Replay: The Right and Wrong Way to Interview Engineers

This episode originally aired on July 17, 2020.

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/the_right_and_wrong_way_to_interview_engineers/

Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 21 Dec 2022 07:30:00 -0800
A Multi-Cloud Rant (Holiday Replay)

This episode was originally released on August 20, 2021.

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/a_multicloud_rant/


Want to watch a rant about Multi-Cloud? Watch our Multi-Cloud is a Terrible Idea YouTube Video here: https://youtu.be/Mlr7vioQqwg


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 14 Dec 2022 07:30:00 -0800
The Unfulfilled Promise of Serverless

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/The-Unfulfilled-Promise-of-Serverless/


This episode was originally released on November 3, 2021.

Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 07 Dec 2022 07:30:00 -0800
The Releases are Coming Fast and Furious Now

Links:


Stay Up To Date with re:Quinnvent


Help the show


What's Corey up to?

Wed, 30 Nov 2022 07:30:00 -0800
The Releases of re:Invent are in Full Swing
Tue, 29 Nov 2022 07:30:00 -0800
The Feudal Lords of Amazon: AWS' Infinite Service Launches and Counterproductive Culture

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/the-feudal-lords-of-amazon/


Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/g1guW6tiR50


Never miss an episode


Help the show


Buy our merch


What's Corey up to?


Wed, 23 Nov 2022 07:30:00 -0800
The Canary in the Git Mine
Thu, 17 Nov 2022 03:00:00 -0800
How To Learn Something New: Kubernetes The Much Harder Way

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/How-To-Learn-Something-New-Kubernetes-the-Much-Harder-Way

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/bpp5tpgU6CE


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 16 Nov 2022 07:30:00 -0800
gp3 for thee, RDS
Mon, 14 Nov 2022 03:00:00 -0800
An alterNAT Future: We Now Have a NAT Gateway Replacement

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/an-alternat-future-we-now-have-a-nat-gateway-replacement/


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 09 Nov 2022 07:30:00 -0800
Azure Makes it Worse

Links:

Thu, 03 Nov 2022 03:00:00 -0700
AWS re:Invent: What You Actually Need To Know Before You Go

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/aws-re-invent-what-you-actually-need-to-know-before-you-go/

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/lZPDfTXmfI4


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 02 Nov 2022 07:30:00 -0700
The pre:Invent Drumbeat Starts
Mon, 31 Oct 2022 03:00:00 -0700
The Real Reason Cloud IDE Adoption Is Lagging

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/the-real-reason-cloud-ide-adoption-is-lagging


Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here:

https://youtu.be/fRc0maN0Z_I


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 26 Oct 2022 07:30:00 -0700
Azure: Less a Cloud Than Performance Art

Links:

Thu, 20 Oct 2022 03:00:00 -0700
A Brief History of Kubernetes, Its Use Cases, and Its Problems

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/a-brief-history-of-kubernetes-its-use-cases-and-its-problems

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/StlZwvsq9tc


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 19 Oct 2022 07:30:00 -0700
AWS Data Transfer Charges: Ingress Actually Is Free

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/aws-data-transfer-charges-ingress-actually-is-free/


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 12 Oct 2022 07:30:00 -0700
Confidential Computing Is a Cloud Paranoia-Based Wasteland

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/confidential-computing-is-for-the-tinfoil-hat-brigade

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/z_jD64jGhhI


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 05 Oct 2022 07:30:00 -0700
Inadvertent Compliance Week

Links:

Thu, 29 Sep 2022 03:00:00 -0700
The Baffling Maze of Kubernetes

Want to give your ears a break and read this as an article? You’re looking for this link.

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/iOqSjqhD2lc


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 28 Sep 2022 07:30:00 -0700
Getting Twitchy About the AWS Bill
AWS Morning Brief for the week of Monday, September 26th with Corey Quinn.
Mon, 26 Sep 2022 03:00:00 -0700
Connecting All William-Nilliam

Links:

Thu, 22 Sep 2022 03:00:00 -0700
The Next AWS CMO: Corey Quinn

Want to give your ears a break and read this as an article? You’re looking for this link.

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/2ve_Xmtx7_o


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 21 Sep 2022 07:30:00 -0700
The Swole Architected Framework
AWS Morning Brief for the week of September 19th, 2022 with Corey Quinn.
Mon, 19 Sep 2022 03:00:00 -0700
Naming Things Accurately

Links:

  • Nick Frichette wrote an incredibly handy guide on the ordered steps to take to avoid CloudFront or DNS domain takeovers on AWS.
  • This handy walkthrough talks about how to configure something that shrieks its head off whenever someone logs into AWS via the root account.
  • The Center for Internet Security just released an update to the AWS version of their security benchmarks, and this approachable post goes through what's new.
  • Introducing message data protection for Amazon SNS - This is a bit hard to wrap my head around--then Scott Piper nailed it with "it's Macie for SNS and now I'm wondering what the point of me even is.
  • I've talked about Parliament before--it's an AWS IAM linting library. Version 1.6.0 just dropped.
  • I'll be in the DC area next week; come by Highline at 7PM and let me buy you a drink / swap stories if you're around.
Thu, 15 Sep 2022 03:00:00 -0700
Google Cloud Functions Is Surprisingly Delightful

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/google-cloud-functions-is-surprisingly-delightful

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/lV-Q0EO63fo


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 14 Sep 2022 07:30:00 -0700
AWS Deft Punk
AWS Morning Brief for the week of September 12, 2022 with Corey Quinn.
Mon, 12 Sep 2022 03:00:00 -0700
Mobile Authentication to AWS is Hard

Links:

Thu, 08 Sep 2022 03:00:00 -0700
The Harrowing Search for the Elusive Technical Answer

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/the-harrowing-search-for-the-elusive-technical-answer

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/mZDquxNO09s\\

Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 07 Sep 2022 07:30:00 -0700
26.5 AWS Regions
AWS Morning Brief for the week of September 5, 2022 with Corey Quinn.
Tue, 06 Sep 2022 03:00:00 -0700
The Spiritual Alignment of Cloud Economics

Links:

Thu, 01 Sep 2022 03:00:00 -0700
How Google Cloud and AWS Approach Customer Carbon Emissions

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/how-google-cloud-and-aws-approach-customer-carbon-emissions

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/eyO1DqP9LhY


Never miss an episode


Help the show


Buy our merch


What's Corey up to?


Wed, 31 Aug 2022 07:30:00 -0700
The Root Beer Conference
AWS Morning Brief for the week of August 29, 2022 with Corey Quinn.
Mon, 29 Aug 2022 03:00:00 -0700
Rumors All Atwitter
Thu, 25 Aug 2022 03:00:00 -0700
Amazon SageMaker is Responsible for My Surprise Bill

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/sagemaker_is_responsible_for_my_surprise_bill/


Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/LCZjSZhRAjs


Never miss an episode


Help the show


Buy our merch


What's Corey up to?

Wed, 24 Aug 2022 07:30:00 -0700
Low Tech Earthquake Detection
AWS Morning Brief for the week of August 22, 2022 with Corey Quinn.
Mon, 22 Aug 2022 03:00:00 -0700
Trivy-al Releases

Links:

Thu, 18 Aug 2022 03:00:00 -0700
An Unexpected Love Letter to Azure

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/an_unexpected_love_letter_to_azure/

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/NIsF_NS1B0k

Never miss an episode

Help the show

Buy our merch

What's Corey up to?

Wed, 17 Aug 2022 07:30:00 -0700
AWS Private 5G v2
AWS Morning Brief for the week of August 15, 2022 with Corey Quinn.
Mon, 15 Aug 2022 03:00:00 -0700
Twilio's Insecure Text Message Issue

Links:

Thu, 11 Aug 2022 03:00:00 -0700
Cadence Is Culture: Why Amazonians Need to Overload Us at re:Invent

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/why_amazon_cant_end_the_release_tidal_wave/

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/eKMxBNF5N-k


Never miss an episode


Help the show


What's Corey up to?

Wed, 10 Aug 2022 07:30:00 -0700
Very Tired Lambda Pricing
AWS Morning Brief for the week of August 8, 2022 with Corey Quinn.
Mon, 08 Aug 2022 03:00:00 -0700
Single Sign On, Multiple Names

Links:

Thu, 04 Aug 2022 03:00:00 -0700
Are AWS account IDs sensitive information?

Want to give your ears a break and read this as an article? You’re looking for this link.


Never miss an episode


Help the show


What's Corey up to?

Wed, 03 Aug 2022 07:30:00 -0700
Crappy Clone of a Fast Database
AWS Morning Brief for the week of August 1, 2022 with Corey Quinn.
Mon, 01 Aug 2022 03:00:00 -0700
Never Gonna Shut Me Up

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/Q2Zpg5jQe-Q

Never miss an episode


Help the show


What's Corey up to?

Thu, 28 Jul 2022 07:30:00 -0700
The Mental Breakdown of Auto-Remediation
Wed, 27 Jul 2022 07:30:00 -0700
New Cloudscape Cloudscrapes
AWS Morning Brief for the week of July 25, 2022 with Corey Quinn.
Mon, 25 Jul 2022 03:00:00 -0700
Azure's Security Vulnerabilities are Out of Control

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/azures_vulnerabilities_are_quack

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/5iTxtBnCPys


Never miss an episode


Help the show


What's Corey up to?

Wed, 20 Jul 2022 07:30:00 -0700
Immortal AWS Accounts, the Methuselah Pattern
AWS Morning Brief for the week of July 18th, 2022 with Corey Quinn.
Mon, 18 Jul 2022 03:00:00 -0700
AWS Bakery: Rolls Everywhere

Links:

Thu, 14 Jul 2022 03:00:00 -0700
My Security Posture

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/coreys-security-posture-2022


Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/dHDY69hIvvk


Never miss an episode


Help the show


What's Corey up to?

Wed, 13 Jul 2022 07:30:00 -0700
How I Spent My Summer Vacation and College Tuition
AWS Morning Brief for the week of July 11, 2022 with Corey Quinn.
Mon, 11 Jul 2022 03:00:00 -0700
The ChatOps Issue That No One's Chatting About

Want to give your ears a break and read this as an article? You’re looking for this link:

https://www.lastweekinaws.com/blog/the-chatops-issue-no-ones-chatting-about


Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/eBKZ71OLjG8


Never miss an episode


Help the show


What's Corey up to?

Wed, 06 Jul 2022 07:30:00 -0700
Mr. Selipsky's Geography Class
AWS Morning Brief for the week of July 4th, 2022 with Corey Quinn.
Tue, 05 Jul 2022 03:00:00 -0700
Enter Your Passwordle

Links:

Thu, 30 Jun 2022 03:00:00 -0700
9 Ways AWS Made Me Headdesk When Using The CDK

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/9-ways-aws-cdk-headdesk


Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/3Mf3_l6iEtA

Never miss an episode


Help the show


What's Corey up to?

Wed, 29 Jun 2022 07:30:00 -0700
Concerning Your DeepRacer's Extended Warranty
AWS Morning Brief for the week of June 27, 2022 with Corey Quinn.
Mon, 27 Jun 2022 03:00:00 -0700
Bugcrowd Bugs the Crowd

Links:

Thu, 23 Jun 2022 03:00:00 -0700
Should I Take a Job at AWS?

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/should-you-take-a-job-at-aws/

Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/BCiUulzr9f8



Never miss an episode



Help the show



What's Corey up to?

Wed, 22 Jun 2022 07:30:00 -0700
Add a Mantium
AWS Morning Brief for the week of June 20, 2022 with Corey Quinn.
Tue, 21 Jun 2022 03:00:00 -0700
re:Invent Keynote 2026: Analysis

Want to give your ears a break and read this as an article? You’re looking for this link:

https://www.lastweekinaws.com/blog/reinvent-keynote-incident/


Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/NGvLMsf4Wg8

Never miss an episode



Help the show


Wed, 15 Jun 2022 03:00:00 -0700
Cars 4, featuring "Pixar Tractor on AWS”
AWS Morning Brief for the week of June 13, 2022 with Corey Quinn.
Mon, 13 Jun 2022 03:00:00 -0700
The Strange, Too Familiar Tale of Uncle Suitcase

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/the-strange-too-familiar-tale-of-uncle-suitcase/


Want to watch the full dramatic reenactment of this podcast? Watch the YouTube Video here: https://youtu.be/x70EypnAH1Y



Never miss an episode



Help the show



What's Corey up to?


Wed, 08 Jun 2022 03:00:00 -0700
Googling the AWS CDK V1
AWS Morning Brief for the week of June 6, 2022, with Corey Quinn.
Mon, 06 Jun 2022 03:00:00 -0700
The Aurora Serverless Road Not Taken

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/the-aurora-serverless-road-not-taken/



Never miss an episode



Help the show



What's Corey up to?

Wed, 01 Jun 2022 03:00:00 -0700
Amazon Basics NXP Chips from Annapurna Labs
AWS Morning Brief for the week of May 30, 2022 with Corey Quinn.
Mon, 30 May 2022 03:00:00 -0700
An AWS Free Tier Bill Shock: Your Next Steps

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/an-aws-free-tier-bill-shock-your-next-steps



Never miss an episode



Help the show



What's Corey up to?

Wed, 25 May 2022 07:30:00 -0700
Amazon's Original Risk Store
AWS Morning Brief for the week of May 23, 2022 with Corey Quinn.
Mon, 23 May 2022 03:00:00 -0700
F5 Exploit the Exact Opposite of Refreshing
Thu, 19 May 2022 03:00:00 -0700
Fixing the AWS Free Tier is No Longer Optional

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/an-aws-free-tier-bill-shock-your-next-steps/



Never miss an episode



Help the show



What's Corey up to?

Wed, 18 May 2022 07:30:00 -0700
Amazon Data Fencing
AWS Morning Brief for the week of May 16, 2022 with Corey Quinn.
Mon, 16 May 2022 03:00:00 -0700
Suddenly Nobody Wants to Build Heroku

Links:

Thu, 12 May 2022 03:00:00 -0700
AWS's Deprecation Policy Is Like a Platypus

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/aws-s-deprecation-policy-is-like-a-platypus



Never miss an episode



Help the show



What's Corey up to?

Wed, 11 May 2022 07:30:00 -0700
AWS WindWanker
AWS Morning Brief for the week of May 9, 2022 with Corey Quinn.
Mon, 09 May 2022 03:00:00 -0700
How to Win in Cloud

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/how-to-win-in-cloud



Never miss an episode



Help the show



What's Corey up to?

Wed, 04 May 2022 07:30:00 -0700
Amazon CloudWatch for Sharon
AWS Morning Brief for the week of May 2, 2022 with Corey Quinn.
Mon, 02 May 2022 08:23:39 -0700
AWS Starts the Security Communication Improvement Slog
Thu, 28 Apr 2022 03:00:00 -0700
AWS's Open Source Problem

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/aws-s-open-source-problem




Never miss an episode



Help the show



What's Corey up to?

Wed, 27 Apr 2022 07:30:00 -0700
AWS GoForIt (With Expedia Group Compatibility)
AWS Morning Brief for the week of April 25, 2022 with Corey Quinn.
Mon, 25 Apr 2022 03:00:00 -0700
Shitposting as a Learning Style

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/shitposting-as-a-learning-style



Never miss an episode



Help the show



What's Corey up to?

Wed, 20 Apr 2022 07:30:00 -0700
Amazon's Competitive Advantage
AWS Morning Brief for the week of April 18, 2022 with Corey Quinn.
Mon, 18 Apr 2022 03:00:00 -0700
Denonia Denials
Thu, 14 Apr 2022 03:00:00 -0700
Taking AWS Account Logins For Granted

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/taking-aws-account-logins-for-granted



Never miss an episode



Help the show



What's Corey up to?

Wed, 13 Apr 2022 07:30:00 -0700
Requiem for a Weasel
AWS Morning Brief for the week of April 11, 2022 with Corey Quinn.
Mon, 11 Apr 2022 03:00:00 -0700
Okta and Ubiquiti Duel For Negative Attention

Links Referenced:

Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: Today’s episode is brought to you in part by our friends at MinIO the high-performance Kubernetes native object store that’s built for the multi-cloud, creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you’re defining those as, which depends probably on where you work. It’s getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere, and that’s exactly what MinIO offers. With superb read speeds in excess of 360 gigs and 100 megabyte binary that doesn’t eat all the data you’ve gotten on the system, it’s exactly what you’ve been looking for. Check it out today at min.io/download, and see for yourself. That’s min.io/download, and be sure to tell them that I sent you.

Corey: A somehow quiet week as we all grapple with the recent string of security failures from, well, take your pick really.

A bit late but better than never, Okta’s CEO admits the LAPSUS$ hack has damaged trust in the company. The video interview is surprisingly good in parts, but he ruins the, “Third-party this, third-party that, no—it was our responsibility, and our failure” statement by then saying that they no longer do business with Sitel—the third-party who was responsible for part of this breach. Crisis comms is really something to figure out in advance of a crisis, so you don’t get in your own way.

Paul Vixie, creator of a few odds and ends such as DNS, has taken a job as a Distinguished Engineer VP at AWS and I look forward to misusing more of his work as databases. He’s apparently in the security org which is why I’m talking about today and not Monday.

And of course, as I’ve been ranting about in yesterday’s newsletter and on Twitter, Ubiquiti has sued Brian Krebs for defamation. Frankly they come off as far, far worse for this than they did at the start. My position has shifted from one of sympathy to, “Well, time to figure out who sells a 10Gbps switch that isn’t them.”

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

AWS had an interesting post: “Best practices: Securing your Amazon Location Service resources”. AWS makes a good point here. It hadn’t occurred to me that you’d need to treat location data particularly specially, but of course you do. The entire premise of the internet falls apart if it suddenly gets easier to punch someone in the face for something they said on Twitter.



And two tools of note this week for you. Access Undenied parses AWS AccessDenied CloudTrail events, explains the reasons for them, and offers actionable fixes. And aws-keys-sectool does something obvious in hindsight: Making sure that any long-lived credentials on your machine are access restricted to your own IP address. Check it out. And that’s what happened last week in AWS security. Continue to make good choices because it seems very few others are these days.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 07 Apr 2022 03:00:00 -0700
Ubiquiti Teaches AWS Security and Crisis Comms Via Counterexample

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/ubiquiti-teaches-aws-security-and-crisis-comms-via-counterexample



Never miss an episode



Help the show



What's Corey up to?

Wed, 06 Apr 2022 07:30:00 -0700
I Am Not Responsible For the Content or Accuracy of This Podcast
AWS Morning Brief for the week of April 4, 2022 with Corey Quinn.
Mon, 04 Apr 2022 03:00:00 -0700
The Perils of Bad Corporate Comms

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: Today’s episode is brought to you in part by our friends at MinIO the high-performance Kubernetes native object store that’s built for the multi-cloud, creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you’re defining those as, which depends probably on where you work. It’s getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere, and that’s exactly what MinIO offers. With superb read speeds in excess of 360 gigs and 100-megabyte binary that doesn’t eat all the data you’ve gotten on the system, it’s exactly what you’ve been looking for. Check it out today at min.io/download, and see for yourself. That’s min.io/download, and be sure to tell them that I sent you.

Corey: The Okta breach continues to reverberate. As of this recording, the real damage remains the lack of clear, concise, and upfront communication about this. It’s become very clear that had the Lapsus$ folks not gone public about the breach, Okta certainly never would have either.



Now, from the community. Let’s see what they had to say. Cloudflare has posted the results of their investigation of the January 2022 Okta compromise to their blog post and I have a few things I want to say about it.



First, I love that they do this. I would be a bit annoyed at them taking digs at other companies except for the part where they’re at least as rigorous in investigations that they post about their own security and uptime challenges. Secondly, they’ve been levelheaded and remarkably clear in their communication around the issue which only really affects them as an Okta customer. Okta themselves have issued a baffling series of contradicting claims. Regardless of the truth of what happened from a security point of view, the lack of ability to quickly and clearly articulate the situation means that Okta is now under a microscope for folks who care about security—which basically rounds to every last one of their customers.



Now, I generally don’t talk too much about tweets because this is Twitter revisited as a general rule, but Scott Piper had an issue about trying to keep his flaws.cloud thing open, and he got an account being closed down notice from AWS. And a phrase he used that I loved was, “You know it’s a legit AWS email because the instructions are very bad.”



I really can’t stress enough that while clear communication is always a virtue, circumstances involving InfoSec, fraud, account closures, and similar should all be ones in which particular care is taken to exactly what you say and how you say it.

An NPM package maintainer sabotaged their own package to protest the war in Ukraine, which is a less legitimate form of protest than many others. There’s never been a better time to make sure you’re pinning dependencies in your various projects.

It’s always worth reading an article titled “AWS IAM Demystified” because it’s mystifying unless you’re one of a very small number of people. I learned new things myself by doing that and you probably will too.



And oof. A while back Cognito User Groups apparently didn’t have delimiter detection
working quite right. As a result, you could potentially get access to groups you weren’t supposed to be part of. While AWS did update some of their documentation and fix the problem, it’s a security issue without provable customer impact, so of course, we’re learning about it from a third-party: Opsmorph in this case. Good find.

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Corey: Now, from the mouth of the AWS horse itself, “Generate logon messages for security and compliance in Amazon WorkSpaces.” for compliance, sure. For security, can you name a single security benefit to having a logon message greet users? “It reminds them that—” Yeah, yeah, nobody reads the popup ever again after the first time, and not always the first time either. Security is important—and fatiguing your users into not reading pop-up messages that don’t respect their time is a great way to teach them to ignore you. Don’t do it.

“Ransomware mitigation: Using Amazon WorkDocs to protect end-user data”. Security through obscurity has been thoroughly debunked by security professionals everywhere, but I still can’t help but think that WorkDocs is so narrowly deployed in the industry that it’s never really caught the attention of bad actors.

And “CVE-2022-0778 awareness”. Cross-account access between their customers, AWS is largely silent about, but an OpenSSL issue, “In which a certificate containing invalid explicit curve parameters can cause a Denial of Service (DoS) by triggering an infinite logic loop” is clearly Not Their Fault, so of course, this is the thing that gets a rather rare security bulletin from them. Of course, as of the time of recording this, it hadn’t been updated past an initial ‘we’re aware of the issue.’

And in the world of tools, ElectricEye is a set of Python scripts—affectionately called Auditors—that continuously monitor your AWS infrastructure looking for configurations related to confidentiality, integrity, and availability that align, or don’t align—the other way—with AWS best practices. The fact that it’s open-source and free is eyebrow-raising because usually things that do this cost thousands and thousands of dollars. ElectricEye instead leaves that part to AWS Security Hub itself. And that’s what happened last week in the wide world of AWS. I’m Corey Quinn, thanks for listening.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 31 Mar 2022 03:00:00 -0700
S3 Is Not a Backup

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/s3-is-not-a-backup




Never miss an episode



Help the show



What's Corey up to?

Wed, 30 Mar 2022 07:30:00 -0700
Speaking to the Dead with Amazon Chime
AWS Morning Brief for the week of March 28, 2022 with Corey Quinn.
Mon, 28 Mar 2022 03:00:00 -0700
Is Okta Gone?

Links Referenced:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.

Corey: Last week AWS quietly updated the re:Inforce site to reflect that instead of Houston, their security conference, held ideally annually, would be taking place this July in Boston. Given that Texas’s leadership has been doing what appears to be its level best to ensure that respectable businesses don’t want to do business there, this is an incredible logistical, and frankly moral, feat that AWS has pulled off.

Corey: That’s the good news. The bad news of course is as this issue went to print, the news coming out of Okta about a breach remains disturbingly murky. I’m trying here to provide the best take rather than the first take, so I really hope someone’s going to have better data for me by next week. Oof. Condolences to everyone who is affected.



Yeah, other than that, from the security community, a while back I had a bit of a conniption fit about how RDS doesn’t mandate SSL/TLS connections. For a company whose CTO’s tagline and t-shirt both read “Encrypt Everything” this strikes me as… discordant. A blog post I stumbled over goes into far greater detail about what exactly is requiring encryption and what isn’t. Make sure your stuff is being secure when you think it is, is the takeaway here. Verify these things or other people will be thrilled to do so for you, but you won’t like it very much.

Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured, and fully managed with built-in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price-performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: Make your data sing.

Corey: AWS had one notable security announcement that didn’t come from their security blog. AWS Lambda announces support for PrincipalOrgID in resource-based policies. Now, that’s a fancy way to say, “All of the resources within my AWS organization can talk to this Lambda Function,” which in common parlance is generally historically expressed as just granting access to the world and hoping people don’t stumble across it. I like this new way significantly more; you should too.

And from the world of tools, I found two of interest. Hopefully, folks aren’t going to need this, but AWS Labs has an Automated Incident Response and Forensics Framework that helps you not do completely wrong things in the midst of a security incident. It’s worth reviewing if for no other reason than the discussions it’s likely to spark. Because security has always been more about people than tools. Occasionally it’s about people who are tools, but that’s just uncharitable, so let’s be kinder.

This CI/CDon’t tool is awesome; it intentionally deploys vulnerable software or infrastructure to your AWS account so you can practice exploiting it. I’m a sucker for scenario-based learning tools like this one, so I have a sneaking suspicion maybe some of you might be, too. And that’s what happened last week in AWS security. Thank you for listening. I’m Cloud Economist Corey Quinn. Ugh, this week is almost
over.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 24 Mar 2022 03:00:00 -0700
Google Cloud Alters the Deal

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/google-cloud-alters-the-deal



Never miss an episode



Help the show



What's Corey up to?

Wed, 23 Mar 2022 07:30:00 -0700
Conducting the AWS Billing Train
AWS Morning Brief for the week of March 21, 2022 with Corey Quinn.
Mon, 21 Mar 2022 03:00:00 -0700
The Surprise Mandoogle

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured, and fully managed with built-in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: Make your data sing.

Hello and welcome to Last Week in AWS Security. A lot has happened; let’s tear into it.

So, there was a “Sort of yes, sort of no” security issue with CodeBuild that I’ve talked about previously. The blog post I referenced has, in fact, been updated. AWS has stated that, “We have updated the CodeBuild service to block all outbound network access for newly created CodeBuild projects which contain a customer-defined VPC configuration,” which indeed closes the gap. I love happy endings.

On the other side, oof. Orca Security found a particularly nasty Azure breach called AutoWarp. You effectively could get credentials for other tenants by simply asking a high port on localhost for them via curl or netcat. This is bad enough; I’m dreading the AWS equivalent breach in another four months of them stonewalling a security researcher if the previous round of their nonsense silence about security patterns is any indicator.

“Google Announces Intent to Acquire Mandiant”. This is a big deal. Mandiant has been a notable center of excellent cybersecurity talent for a long time. Congratulations or condolences to any Mandoogles in the audience. Please let me know how the transition goes for you.

Hive Systems has updated its password table for 2022, which is just a graphic that shows how long passwords of various levels of length and complexity would take to break on modern systems. The takeaway here is to use long passwords and use a password manager.

Corey: You know the drill: You’re just barely falling asleep and you’re jolted awake by an emergency page. That’s right, it’s your night on call, and this is the bad kind of Call of Duty. The good news is, is that you’ve got New Relic, so you can quickly run down the incident checklist and find the problem. You have an errors inbox that tells you that Lambdas are good, RUM is good, but something’s up in APM. So, you click the error and find the deployment marker where it all began. Dig deeper, there’s another set of errors. What is it? Of course, it’s Kubernetes, starting after an update. You ask that team to roll back and bam, problem solved. That’s the value of combining 16 different monitoring products into a single platform: You can pinpoint issues down to the line of code quickly. That’s why the Dev and Ops teams at DoorDash, GitHub, Epic Games, and more than 14,000 other companies use New Relic. The next late-night call is just waiting to happen, so get New Relic before it starts. And you can get access to the whole New Relic platform at 100 gigabytes of data free, forever, with no credit card. Visit newrelic.com/morningbrief that’s newrelic.com/morningbrief.

And of course, another week, another terrifying security concern. This one is called DirtyPipe. It’s in the Linux kernel, and the name is evocative of something you’d expect to see demoed onstage at re:Invent.



Now, what did AWS have to say? Two things. The first is “Manage AWS resources in your Slack channels with AWS Chatbot”. A helpful reminder that it’s important to restrict access to your AWS production environment down to just the folks at your company who need access to it. Oh, and to whomever can access your Slack workspace who works over at Slack, apparently. We don’t talk about that one very much, now do we?

And the second was, “How to set up federated single-sign-on to AWS using Google Workspace”. This is super-aligned with what I want to do, but something about the way that it’s described makes it sounds mind-numbingly complicated. This isn’t a problem that’s specific to this post or even to AWS; it’s industry-wide when it comes to SSO. I’m starting to think that maybe I’m the problem here.

And lastly, AWS has open-sourced a tool called Cloudsaga, designed to simulate security events in AWS. This may be better known as, “Testing out your security software,” and with sufficiently poor communication, “Giving your CISO a heart attack.”

And that’s what happened last week in AWS security. If you’ve enjoyed it, please tell your friends about this place. I’ll talk to you next week.



Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 17 Mar 2022 03:00:00 -0700
My Mental Model of AWS Regions

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/my-mental-model-of-aws-regions



Never miss an episode



Help the show



What's Corey up to?

Wed, 16 Mar 2022 07:30:00 -0700
The 20-for-1 AWS Container Services Split
AWS Morning Brief for the week of March 14, 2022 with Corey Quinn.
Mon, 14 Mar 2022 03:00:00 -0700
Collecting Evidence for the Prosecution

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.

Corey: Well, oops. Last week in the newsletter version of this podcast I used the wrong description for a link. On the plus side, I do find myself wondering if anyone hunts down the things I talk about on this podcast and the newsletter I send out, and now I know an awful lot of you do. And you have opinions about the correctness of my links. The actual tech company roundup that I linked to last week was, in fact, not an AWS blog post about QuickSight community—two words that are an oxymoron if ever two were—but instead a roundup in The Register. My apologies for the oversight. Now, let’s dive into what happened last week in the wide world of AWS security.

In my darker moments, I find myself asking a very blunt question: “WTF is Cloud Native Data Security?” I confess it never occurred to me to title a blog post with that question, and this article I found with that exact title is in fact one of the better ones I’ve read in recent days. Check it out if the subject matter appeals to you even slightly because you’re in for a treat. There’s a lot to unpack here.

Scott Piper has made good on his threat to publish a imdsv2 wall of shame. So far, two companies have been removed from the list for improving their products’ security posture—I know, it’s never happened before—but this is why we care about these things. It’s not to make fun of folks; it’s to make this industry better than it was.

A while back I talked about various cloud WAFs—most notably AWS’s—having a fun and in-hindsight-obvious flaw of anything above 8KB just sort of dances through the protective layer. Well, even Google and its, frankly, impressive security apparatus isn’t immune. There’s an article called “Piercing the Cloud Armor” that goes into it. This stuff is hard, but honestly, this is kind of a recurring problem. I’m sort of wondering, “Well, what if we make the packet bigger?” Wasn’t that the whole problem with the Ping of Death, back in the ’80s? Why is that still a thing now?

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

And of course, a now patched vulnerability in Amazon Alexa meant that the speaker could activate itself. Because it’s a security problem with an Amazon product that I’ve paid for, I of course learn about this via a third-party talking about it. Man, my perspective on Amazon’s security messaging as a whole has gone from glowing to in the toilet remarkably quickly this year. And it’s their own damn fault.



Now, AWS had a single post of note here called “Streamlining evidence collection with AWS Audit Manager”. This post slash quote-unquote “Solution” highlights a concern that’s often overlooked by security folks. It very innocently talks about collecting evidence for an audit, which is perfectly reasonable.



You need evidence that your audit controls are being complied with. Now, picture someone walking past a room where you’re talking about this, and all they hear is “Evidence collection.” Maybe they’re going to feel like there’s more going on here than an audit. Perhaps they’re going to let their guilty conscience—and I assure you, everyone has one—run wild with fears that whatever imagined transgression they’ve committed has been discovered? Remember the human.

And of course, I found two tools in open-source universe that might be of interest to folks. The first: AWS has open-sourced a security assessment solution to use Prowler and ScoutSuite that scan your environment. It’s handy, but I’m having a hell of a hard time reconciling its self-described ‘inexpensive’ with ‘it deploys a Managed NAT gateway.’

And Domain Protect—an open-source project with a surprisingly durable user interface—scans dangling DNS entries to validate that you’re not, y’know, leaving a domain of yours open to exploit. You’re going to want to pay attention to this vector, but we haven’t for 15 years, so why would we start now? And that’s what happened last week in the wide world of AWS security. I am Cloud Economist Corey Quinn. Thank you for listening. There’s always more yet to come.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 10 Mar 2022 03:00:00 -0800
Handling Secrets with AWS

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/handling-secrets-with-aws



Never miss an episode



Help the show



What's Corey up to?

Wed, 09 Mar 2022 07:30:00 -0800
Unnamed Podcast That Informs and Snarks about AWS News
AWS Morning Brief for the week of March 7, 2022 with Corey Quinn.
Mon, 07 Mar 2022 03:00:00 -0800
Corporate Solidarity

Links:

Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured, and fully managed with built-in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: Make your data sing.

Corey: We begin with a yikes because suddenly the world is aflame and of course there are cybersecurity considerations to that. I’m
going to have more on that to come in future weeks because my goal with this podcast is to have considered takes, not the rapid-response, alarmist, the-world-is-ending ones. There are lots of other places to find those. So, more to come on that.

In happier news, your favorite Cloud Economist was quoted in the Wall Street Journal last week, talking about how staggering Microsoft’s security surface really is. And credit where due, it’s hard to imagine a better person for the role than Charlie Bell. He’s going to either fix a number of systemic problems at Azure or else carve his resignation letter into Satya Nadella’s door with an axe. I really have a hard time envisioning a third outcome.

A relatively light week aside from that. The Register has a decent roundup of how various companies are responding to Russia’s invasion of a sovereign country. Honestly, the solidarity among those companies is kind of breathtaking. I didn’t have that on my bingo card for the year.

Corey: You know the drill: You’re just barely falling asleep and you’re jolted awake by an emergency page. That’s right, it’s your night on call, and this is the bad kind of Call of Duty. The good news is, is that you’ve got New Relic, so you can quickly run down the incident checklist and find the problem. You have an errors inbox that tells you that Lambdas are good, RUM is good, but something’s up in APM. So, you click the error and find the deployment marker where it all began. Dig deeper, there’s another set of errors. What is it? Of course, it’s Kubernetes, starting after an update. You ask that team to roll back and bam, problem solved. That’s the value of combining 16 different monitoring products into a single platform: You can pinpoint issues down to the line of code quickly. That’s why the Dev and Ops teams at DoorDash, GitHub, Epic Games, and more than 14,000 other companies use New Relic. The next late-night call is just waiting to happen, so get New Relic before it starts. And you can get access to the whole New Relic platform at 100 gigabytes of data free, forever, with no credit card. Visit newrelic.com/morningbrief that’s newrelic.com/morningbrief.



Corey: If you expose 200GB of data it’s bad. If that data belongs to customers, it’s worse. If a lot of those customers are themselves children, it’s awful. But if you ignore reports about the issue, leave the bucket open, and only secure it after your government investigates you for ignoring it under the GDPR, you are this week’s S3 Bucket Negligence Awardwinner and should probably be fired immediately.

AWS had a single announcement of note last week. “Fine-tune and optimize AWS WAF Bot Control mitigation capability”, and it’s super important because, with WAF and Bot Control, the failure mode in one direction of a service like this is that bots overwhelm your site. The failure mode in the other direction is that you start blocking legitimate traffic. And the worst failure mode is that both of these happen at the same time.

And a new tool I’m kicking the tires on, Granted. It’s apparently another way of logging into a bunch of different AWS accounts, so it’s time for me to kick the tires on that because I consistently have problems with that exact thing. And that’s what happened last week in AWS security which, let’s be clear, is not the most important area of the world to be focusing on right now. Thanks for listening; I’ll talk to you next week.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 03 Mar 2022 03:00:00 -0800
Status Paging You

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/status-paging-you



Never miss an episode



Help the show



What's Corey up to?

Wed, 02 Mar 2022 07:30:00 -0800
Your AWS S3 Bill is Backup
AWS Morning Brief for the week of February 28, 2022 with Corey Quinn.
Mon, 28 Feb 2022 03:00:00 -0800
Security Developer Experience and Security

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.



Corey: Somehow a week without an S3 Bucket Negligence Award to pass out for anyone. I really hope I’m not tempting fate by pointing that out, but good work, everyone.

So, from the community. Redmonk’s Rachel Stephens once again hits the nail on the head with her post, “Developer Experience is Security”. I don’t believe it’s a coincidence that for a while now I’ve thought that Google Cloud offers not only the best developer experience of the hyperscale clouds but also the best security. I didn’t come to that conclusion lightly.



Also, now that the professional football season is over, the San Francisco 49ers eagerly turn to their off-season task of cleansing their network of ransomware. Ouch. Not generally a great thing when you find that your organization has been compromised and you can’t access any of your data.



Now, AWS had a couple of interesting things out there. “Control access to Amazon Elastic Container Service resources by using ABAC policies”. I was honestly expecting there to be a lot more stories by now of improper tagging being used to gain access via ABAC. The problem here is that for the longest time tagging was at best a billing metadata construct; it made sense to have everything be able to tag itself. Suddenly, with the advent of attribute-based access control, anything that can tag resources now becomes a security challenge.

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

“Introducing s2n-quic—‘sin-i-quick?’ ‘sin-two-quick?’ Yeah—a new open-source QUIC protocol implementation in Rust”. Now, with a name like that, you know it came out of AWS. This is a bit in the weeds for most of us, but the overall lesson to take from the release-slash-announcement is, “Don’t roll your own cryptographic implementation,” with the obvious exception case of, “Unless you are AWS.”



“Top 2021 AWS Security service launches security professionals should review–Part 1”. Okay, this summary post highlights an issue with how AWS talks about things. Some of these enhancements are helpful, some are not, but every last one of them are features to an existing service. Sometimes those refinements are helpful, other times they simply add unneeded complexity to a given customer’s use case. This feels a lot more like a comprehensive listing than it does a curated selection, but maybe that’s just me.

And lastly, I stumbled over a tool called Ghostbuster which is surprisingly easy to use. It scans your DNS records and finds dangling Elastic IPs that can be misused for a variety of different purposes, none of which are going to benefit you directly. It’s been a while since I found a new tool that I was this happy with how straightforward and simple it was to use. Good work. And that’s what happened last week in AWS security. I’m Corey Quinn. Thanks for listening.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 24 Feb 2022 03:00:00 -0800
The Trials and Travails of AWS SSO

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/the-trials-and-travails-of-aws-sso/



Never miss an episode



Help the show



What's Corey up to?

Wed, 23 Feb 2022 03:06:44 -0800
AWS Bill Goes Brrrrrrrrrrrrrrr
AWS Morning Brief for the week of February 20, 2022 with Corey Quinn.
Mon, 21 Feb 2022 03:00:00 -0800
Of CORS It Gets Better

Links Referenced:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.

Corey: So, last week was fairly tame and—no. I’m not going to say that because the last time I said that, all hell broke loose with Log4J and I can’t go through that again.

So, let’s see what happened last week in AWS Security. I like this one very much. Thinkst Canary provides, for free via CanaryTokens.org, an AWS credential generator that spits out IAM credentials with no permissions. The single thing they do is scream bloody murder if someone attempts to use them because those credentials have been stolen. There are some sneaky ways to avoid having the testing of those tokens show up in CloudTrail logs, but they’ve just found a solid way to avoid that sneaky method. It’s worth digging into.

I’ve been a fan of Oracle Cloud for a while, which has attracted some small amount of controversy. I stand by my opinion. That said, there’s been some debate over whether they’re a viable cloud provider at scale. There are certain things I look for as indicators that a cloud provider is a serious contender, and one of them has just been reached: the folks at Orca found a vulnerability around OCI’s handling of Server Side Request Forgery (SSRF) Metadata. It sounds like I’m kidding here, but I’m not. When third-party researchers find a vulnerability that is non-obvious to most of us, that’s an indication that real companies are using services built on top of the platform. Onward.

A donation site raising funds for the Ottawa truckers’ convoy nonsense that’s been going on scored itself an S3 Bucket Negligence Award. No matter how much I may dislike an organization or its policies, I maintain that cybersecurity needs to be available to all.

Corey: You know the drill: you’re just barely falling asleep and you’re jolted awake by an emergency page. That’s right, it’s your night on call, and this is the bad kind of Call of Duty. The good news is, is that you’ve got New Relic, so you can quickly run down the incident checklist and find the problem. You have an errors inbox that tells you that Lambdas are good, RUM is good, but something’s up in APM. So, you click the error and find the deployment marker where it all began. Dig deeper, there’s another set of errors. What is it? Of course, it’s Kubernetes, starting after an update. You ask that team to roll back and bam, problem solved. That’s the value of combining 16 different monitoring products into a single platform: you can pinpoint issues down to the line of code quickly. That’s why the Dev and Ops teams at DoorDash, GitHub, Epic Games, and more than 14,000 other companies use New Relic. The next late-night call is just waiting to happen, so get New Relic before it starts. And you can get access to the whole New Relic platform at 100 gigabytes of data free, forever, with no credit card. Visit newrelic.com/morningbrief that’s newrelic.com/morningbrief.

I knew MFA adoption was struggling among consumers, but I was stunned by Microsoft’s statement that only 22% of enterprise customers have adopted an additional security factor. Please, if you haven’t enabled MFA in your important accounts—and yes, your cloud provider is one of those—please go ahead and do it now.

An interesting security advancement over in the land of Google Cloud, they’ve modified their hypervisor to detect cryptocurrency mining without needing an agent inside of the VM. This beats my usual method of ‘looking for instances with lots of CPU usage because most of the time the fleet is bored.’

Over in AWS-land, they didn’t have anything particularly noteworthy that came out last week for security, so I want to talk a little bit about a service that gets too little love: Amazon CloudTrail. Think of this as an audit log for all of the management events that happen in your AWS account. You’re going to want to secure where the logs live, ideally in another account for your AWS organization. To AWS’s credit, they made the first management trail free a few years ago and enabled it across all accounts by default as a result. This is going to help someone out there, I suspect. Remember, if you haven’t heard about it before, it’s new to you.

And I found a fun tool that’s just transformative because if the bully who beat you up and stole your lunch money in middle school were a technology, they would undoubtedly be CORS, or ‘Cross-Origin Resource Sharing.’ The Amazon API Gateway CORS Configurator tool helps you make it work with API Gateway, and I love this so much. And that’s what happened last week in AWS security. Thanks for listening.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 17 Feb 2022 03:00:00 -0800
Are AWS Account IDs Sensitive Information?

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/are-aws-account-ids-sensitive-information/



Never miss an episode



Help the show



What's Corey up to?

Wed, 16 Feb 2022 03:00:00 -0800
A Billing Glimpse and a CloudFormation Hook
AWS Morning Brief for the week of February 14, 2021 with Corey Quinn.
Mon, 14 Feb 2022 03:00:00 -0800
VPC Data Exfiltration Via CodeBuild

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.



Corey: Hello there. Another week, another erosion of the perception of AWS’s hard security boundaries. I don’t like what 2022 is doing to my opinion of AWS’s security track record. Let’s get into it.

We start this week with a rather disturbing post from Aidan Steele, who talks about using CodeBuild to exfiltrate data from an AWS VPC. We’re increasingly seeing increased VPC complexity, which in turn means that most of us don’t have a full understanding of where the security boundaries and guarantees lie.

Someone decided to scan a bunch of public AWS IP ranges and lo and behold, an awful lot of us suck at security. Specifically, they found Thousands of Open Databases. This is clearly not an exclusively AWS problem seeing as how it falls fairly on the customer side of the Shared Responsibility Model, but it does have the potential to be interpreted otherwise by folks with a less nuanced understanding.

Mark Nunnikhoven has a blog post up that asks the question “Why do Amazon S3 Data Breaches Keep Happening?” I’ve often wondered the same thing. The vector has been known for years, the console screams at you if you attempt to configure things this way, and at this point, there’s really little excuse for a customer making these mistakes. And yet they keep happening.

Scott Piper has had enough. He’s issued a simple warning: If you’re a vendor who offers a solution that deploys EC2 instances to customer environments, and you don’t support IMDSv2, you’re going to be placed on a public list of shame. He’s right: His first shame example is AWS themselves with a new feature release. For those who aren’t aware of what IMDSv2 is, it’s the instance metadata service. Ideally, you have to authenticate against that thing before just grabbing data off of it. This is partially how Capital One wound up getting smacked a couple years back.



Corey: You know the drill: You’re just barely falling asleep and you’re jolted awake by an emergency page. That’s right, it’s your night on call, and this is the bad kind of Call of Duty. The good news is, is that you’ve got New Relic, so you can quickly run down the incident checklist and find the problem. You have an errors inbox that tells you that Lambdas are good, RUM is good, but something’s up in APM. So, you click the error and find the deployment marker where it all began. Dig deeper, there’s another set of errors. What is it? Of course, it’s Kubernetes, starting after an update. You ask that team to roll back and bam, problem solved. That’s the value of combining 16 different monitoring products into a single platform: You can pinpoint issues down to the line of code quickly. That’s why the Dev and Ops teams at DoorDash, GitHub, Epic Games, and more than 14,000 other companies use New Relic. The next late-night call is just waiting to happen, so get New Relic before it starts. And you can get access to the whole New Relic platform at 100 gigabytes of data free, forever, with no credit card. Visit newrelic.com/morningbrief that’s newrelic.com/morningbrief.

Corey: AWS’s Dan Urson has a thread on how to report security issues in other people’s software. Something about it’s been nagging at me, and I think I’ve figured out what it is. Ignore the stuff about, “Have a coherent report,” and, “Demonstrate a reproduction case;” it gets into following the vendor’s procedures and whatnot around disclosure. I think it has to do with where I’m coming from. I generally don’t find security problems, or other bugs, by actively exploiting vendor systems; instead, I trip over them as a customer trying to get something done. The idea that I owe that vendor much of anything when I’m in that position rankles a bit. I get that this is a nuanced topic.

And of course, 3TB of airport employee records were exposed in this week’s S3 Bucket Negligence Award. I hate to sound like I’m overly naive here, but what exactly is in the employee records that makes them take up that much space? I’m a big believer in not storing information you don’t need, and that just seems like an enormous pile of data to have lying around awaiting compromise.

AWS themselves had an interesting post go out: “Security Practices in AWS Multi-Tenant SaaS Environments”. It’s a decent rundown of the things to think about. It’s key to consider concepts like the ones they cover as early in the process as possible, just because otherwise you’re trying to bolt on security after the fact, and I’m sorry, but that just doesn’t work. If it isn’t built in from the beginning, you’re forever going to be playing defense.

And finally, I found an interesting tool from pet hookup app Date-a-Dog has a new open-source project called Stratus Red Team that emulates common attack techniques directly in your cloud environment. This feels like it’s much more aligned with deep-in-the-weeds security offensive teams, but it’s nice to know that their freely available tooling around this, should need it. And that’s what happened last week in AWS security.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 10 Feb 2022 03:00:00 -0800
GuardDuty for EKS and Why Security Should Be Free

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/guardduty-for-eks-and-why-security-should-be-free



Never miss an episode



Help the show



What's Corey up to?

Wed, 09 Feb 2022 03:00:00 -0800
AWS Comcast Service Appointment
AWS Morning Brief for the week of February 7, 2022 with Corey Quinn.
Mon, 07 Feb 2022 03:00:00 -0800
Privacy Means Your Data Is Private to You and Also Google

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.

After the content for this episode was effectively laid out, AWS did a late Friday night announcement of a new GuardDuty enhancement that would automatically opt people in to a chargeable service unless they explicitly opted each account out. This obviously doesn’t thrill me or other affected customers. so, as I record this, the situation is still evolving, but rest assured I’m going to have further thoughts on this next week.

Now, let’s see what happened last week in AWS security. so, last year, Wiz found three vulnerabilities that allowed attackers to read or write into other customers’ AWS accounts. This flew beneath the radar at the time, but they’re all coming out of the woodwork now, and AWS’s security reputation, more or less, lies in tatters, replaced by a reputation for clamming up and admitting nothing. I’m already wincing at this summer’s re:Inforce keynote. if they try their usual messaging line, it’s not going to end well for them.

There was apparently a serious vulnerability within the Linux polkit library. It took Amazon Linux an embarrassingly long time to acknowledge it and put out a release. Now, I’m not a fan of single-vendor Linux installs; any bets on how many non-Amazonians have commit rights to the distribution?

Failing to learn from experience is never a great look, but as per ProPublica, “Companies Leave Vast Amounts of Sensitive Data Unprotected” despite decades of breaches. Please, please, please, if you’re listening to this, don’t be one of them. There’s no value in buying the latest whiz-bang vendor software to defend against state-level actors if you’re going to leave the S3 bucket containing the backups open to the world.

And an uncomfortable reminder that we might not be the only parties perusing our “private” files stored within various cloud providers, Google Drive started mistakenly flagging files as infringing copyright. Now, amusingly the files in question tended to consist entirely of a single character within the file, but the reminder isn’t usually something that cloud providers want dangled in front of us. Once again we are, in fact, reminded that Google considers privacy to be keeping information between you and Google.

Corey: You know the drill: you’re just barely falling asleep and you’re jolted awake by an emergency page. That’s right, it’s your night on call, and this is the bad kind of Call of Duty. The good news is, is that you’ve got New Relic, so you can quickly run down the incident checklist and find the problem. You have an errors inbox that tells you that Lambdas are good, RUM is good, but something’s up in APM. So, you click the error and find the deployment marker where it all began. Dig deeper, there’s another set of errors. What is it? Of course, it’s Kubernetes, starting after an update. You ask that team to roll back and bam, problem solved. That’s the value of combining 16 different monitoring products into a single platform: you can pinpoint issues down to the line of code quickly. That’s why the Dev and Ops teams at DoorDash, GitHub, Epic Games, and more than 14,000 other companies use New Relic. The next late-night call is just waiting to happen, so get New Relic before it starts. And you can get access to the whole New Relic platform at 100 gigabytes of data free, forever, with no credit card. Visit newrelic.com/morningbrief that’s newrelic.com/morningbrief.

AWS had a couple interesting blog posts. One of them was “How to deploy AWS Network Firewall to help protect your network from malware”. and I’m torn on this service, to be honest, because On the one hand, it extends the already annoying pricing model of the Managed NAT Gateway, but On the other, it provides a lot more than simple address translation and is cost-competitive with a number of
other solutions in this space. I think I’m going to land on, “use it if it makes sense for you, but don’t expect it to be cheap.”



And a great blog post from AWS security folks—which is, honestly, something I have said a lot in the past, and I look forward to saying a lot more of in the future—“How to use tokenization to improve data security and reduce audit scope”. “Reducing the scope” is one of the best ways to make audits hurt less, but it tends to be infrequently discussed. This is worth paying attention to.

And lastly, there was an interesting tool that came out. Well, not really so much an interesting tool so much as an interesting blog post that’s a step-by-step walkthrough that features some open-source software and a few configuration options gets you to a place of “Ransomware-resistant backups with S3”. It leverages the Duplicity open-source tool but doesn’t handwave over how the integration works. More like this, please. And that’s what happened last week in AWS security. Thanks for listening, and I’ll talk to you more next week.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.



Announcer: This has been a HumblePod production. Stay humble.

Thu, 03 Feb 2022 03:00:00 -0800
Going Out to Play with the CDK

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/going-out-to-play-with-the-cdk



Never miss an episode



Help the show



What's Corey up to?

Wed, 02 Feb 2022 03:00:00 -0800
Amazon Basics MongoDB Offers Free Trial
AWS Morning Brief for the week of January 31, 2022 with Corey Quinn.
Mon, 31 Jan 2022 03:00:00 -0800
An SSH Key Request

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.

Corey: So, most interesting this week is probably my request for AWS to support a different breed of SSH key. No, it’s not a joke. Listen on and we’ll get there.

So, from the security community last week, everyone talks about how to secure AWS environments. This post takes a different direction and talks about how to secure GitHub organizations, which makes sense if you think about it as an area to focus on. If you compromise an org’s GitHub repositories, it’s basically game over for that company.

I also came across this post from 2020, talking about how if asked politely, CloudTrail would spew other accounts’ credentials your way. How many more exploits like this have we seen and just never been told about?

NCC Group has some great stories up about compromising CI/CD pipelines, and they are all spot on. Because nobody really thinks about the Jenkins box that has everyone working with it, outsized permissions, and of course, no oversight.

Enterprise cloud risk is a very real thing, so a post from Josh Stella, who’s the CEO of Fwage—though he pronounces it as ‘Fugue’—and it makes some excellent points, and also cites me, so of course, I’m going to mention it here. We incentivize the behaviors we want to see more of. There’s a security lesson in there somewhere.

Corey: This episode is sponsored in part by our friends atNew Relic. If you’re like most environments, you probably have an incredibly complicated architecture, which means that monitoring it is going to take a dozen different tools. And then we get into the advanced stuff. We all have been there and know that pain, or will learn it shortly, and New Relic wants to change that. They’ve designed everything you need in one platform with pricing that’s simple and straightforward, and that means no more counting hosts. You also can get one user and a hundred gigabytes a month, totally free. To learn more, visitnewrelic.com. Observability made simple.

Now, from AWS, what have they said? “Amazon EC2 customers can now use ED25519 keys for authentication with EC2 Instance Connect”. I really wish they’d add support for ECDSA keys as well, and no, this is not me making a joke. Those are the only key types Apple lets you store in the Secure Enclave on Macs that support it, and as a result, you can use that while never exporting the private key. I try very hard to avoid having private key material resident on disk, and that would make it one step easier.

“Integrating AWS Security Hub, IBM Netcool, and ServiceNow, to Secure Large Client Deployments”. I keep talking about how if it’s not simple, it’s very hard to secure. AWS, IBM, and ServiceNow, all integrating is about as far from “Simple” as is possible to get.

“Best practices for cross-Region aggregation of security findings”. And this was a post that I was about to snark that it should be as simple as “Click the button,” but then I read my post, and to my surprise and yes, delight, it already is. Good work.

And in the land of tool, I found a post talking about how to assume AWS IAM Roles using SAML.to in GitHub Actions, and I really wish that that was first-party, but I’ll take what I can get. Because again, I despise the idea of permanent IAM credentials just hanging out in GitHub or on disk or, realistically, anywhere. I like these ephemeral approaches. You can be a lot more dynamic with it and breaching those credentials doesn’t generally result in disaster for everyone. And that’s what happened last week in AWS security.



Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 27 Jan 2022 03:00:00 -0800
ClickOps

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/clickops



Never miss an episode



Help the show



What's Corey up to?

Wed, 26 Jan 2022 03:00:00 -0800
AWS Boldly Responds With Silence
AWS Morning Brief for the week of January 24, 2022 with Corey Quinn.
Mon, 24 Jan 2022 03:00:00 -0800
The Gruntled Developer

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by my friends at Thinkst Canary. Most companies find out way too late that they’ve been breached. Thinkst Canary changes this and I love how they do it. Deploy canaries and canary tokens in minutes, and then forget about them. What’s great is then attackers tip their hand by touching them, giving you one alert, when it matters. I use it myself and I only remember this when I get the weekly update with a, “We’re still here, so you’re aware,” from them. It’s glorious. There is zero admin overhead to this, there are effectively no false positives unless I do something foolish. Canaries are deployed and loved on all seven continents. You can check out what people are saying atcanary.love. And, their Kube config canary token is new and completely free as well. You can do an awful lot without paying them a dime, which is one of the things I love about them. It is useful stuff and not a, “Oh, I wish I had money.” It is spectacular. Take a look. That'scanary.love because it’s genuinely rare to find a security product that people talk about in terms of love. It really is a neat thing to see.Canary.love. Thank you to Thinkst Canary for their support of my ridiculous, ridiculous nonsense.



Corey: So, yesterday’s episode put the boots to AWS, not so much for the issues that Orca Security uncovered, but rather for its poor communication around the topic. Now that that’s done, let’s look at the more mundane news from last week’s cloud world. Every day is a new page around here, full of opportunity and possibility in equal measure.

This week’s S3 Bucket Negligence Award goes to the Nigerian government for exposing millions of their citizens to a third party who most assuredly did not follow coordinated disclosure guidelines. Whoops.

There’s an interesting tweet, and exploring it is still unfolding at time of this writing, but it looks that making an API Gateway ‘Private’ doesn’t mean, “To your VPCs,” but rather, “To anyone in a VPC, any VPC, anywhere.” This is evocative of the way that, “Any Authenticated AWS User,” for S3 buckets caused massive permissions issues industry-wide.



And a periodic and growing concern is one of software supply chain—which is a fancy way of saying, “We’re all built on giant dependency chains”—what happens when, say, a disgruntled developer corrupts their own NPM libs ‘colors’ and ‘faker’, breaking thousands of apps across the industry, including some of the AWS SDKs? How do we manage that risk? How do we keep developers gruntled?

Corey: Are you building cloud applications with a distributed team? Check out Teleport, an open-source identity-aware access proxy for cloud resources. Teleport provides secure access for anything running somewhere behind NAT: SSH servers, Kubernetes clusters, internal web apps, and databases. Teleport gives engineers superpowers.

Get access to everything via single sign-on with multi-factor, list and see all of SSH servers, Kubernetes clusters, or databases available to you in one place, and get instant access to them using tools you already have. Teleport ensures best security practices like role-based access, preventing data exfiltration, providing visibility, and ensuring compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.

AWS had a couple of interesting things. The first is “Top ten security best practices for securing backups in AWS”. People really don’t consider the security implications of their backups anywhere near seriously enough. It’s not ‘live’ but it’s still got—by definition—a full set of your data just waiting to be harvested by nefarious types. Be careful with that.

And of course, AWS had two security bulletins, one about its Glue issues, one about its CloudFormation issues. The former allowed cross-account access to other tenants. In theory. In practice, AWS did the responsible thing and kept every access event logged, going back for the full five years of the service’s life. That’s remarkably impressive.



And lastly, I found an interesting tool called S3-credentials last week, and what it does is it helps generate tightly-scoped IAM policies that were previously limited to a single S3 bucket, but now are limited to a single prefix within that bucket. You can also make those credential sets incredibly short-lived. More things like this, please. I just tend to over-scope things way too much. And that’s what happened Last Week in AWS: Security. Please feel free to reach out and tell me exactly what my problem is.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.



Announcer: This has been a HumblePod production. Stay humble.

Thu, 20 Jan 2022 03:00:00 -0800
Orca Security, AWS, and the Killer Whale of a Problem

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/orca-security-aws-and-the-killer-whale-of-a-problem


Never miss an episode



Help the show



What's Corey up to?

Wed, 19 Jan 2022 03:00:00 -0800
New Consolation
AWS Morning Brief for the week of January 17, 2021 with Corey Quinn.
Mon, 17 Jan 2022 03:00:00 -0800
CISOs Should Ideally Stay Out of Prison

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

This episode is sponsored in part by our friends at Rising Cloud, which I hadn’t heard of before, but they’re doing something vaguely interesting here. They are using AI, which is usually where my eyes glaze over and I lose attention, but they’re using it to help developers be more efficient by reducing repetitive tasks. So, the idea being that you can run stateless things without having to worry about scaling, placement, et cetera, and the rest. They claim significant cost savings, and they’re able to wind up taking what you’re running as it is in AWS with no changes, and run it inside of their data centers that span multiple regions. I’m somewhat skeptical, but their customers seem to really like them, so that’s one of those areas where I really have a hard time being too snarky about it because when you solve a customer’s problem and they get out there in public and say, “We’re solving a problem,” it’s very hard to snark about that. Multus Medical, Construx.ai and Stax have seen significant results by using them. And it’s worth exploring. So, if you’re looking for a smarter, faster, cheaper alternative to EC2, Lambda, or batch, consider checking them out. Visit risingcloud.com/benefits. That’s risingcloud.com/benefits, and be sure to tell them that I said you because watching people wince when you mention my name is one of the guilty pleasures of listening to this podcast.

Welcome to Last Week in AWS: Security. Let’s dive in. Norton 360—which sounds like a prelude to an incredibly dorky attempt at the moonwalk—now comes with a cryptominer. You know, the thing that use tools like this to avoid having on your computer? This is apparently to offset how zippy modern computers have gotten, in a direct affront to Norton’s ability to make even maxed-out laptops run like total garbage. Speaking of total garbage, you almost certainly want to use literally any other vendor for this stuff now.

“What’s the worst that can happen?” Is sometimes a comforting thought when dealing with professional challenges. If you’re the former Uber CISO, the answer to that question is apparently, “you could be federally charged with wire fraud for paying off a security researcher.”

And lastly, Azure continues to have security woes, this time in the form of a source code leak of its Azure App Service. It’s a bad six months and counting to be over in Microsoft-land when it comes to cloud.

Let’s take a look what AWS has done. “Comprehensive Cyber Security Framework for Primary (Urban) Cooperative Banks (UCBs)”. This is a perfect case study in what’s wrong with the way we talk about security. First, clicking the link to the report in the blog post threw an error; I had to navigate to the AWS Artifact console and download the PDF manually. Then, the PDF is all of two pages long, as it apparently has an embedded Excel document within it that Preview on my Mac can’t detect. The proper next step is to download Adobe Acrobat for Mac in order to read this, but I’ve given up by this point. This may be the most remarkable case of AWS truly understanding its customer mentality that we’ve seen so far this year.

Are you building cloud applications with a distributed team? Check out Teleport, an open-source identity-aware access proxy for cloud resources. Teleport provides secure access for anything running somewhere behind NAT: SSH servers, Kubernetes clusters, internal web apps, and databases. Teleport gives engineers superpowers. Get access to everything via single sign-on with multi-factor, list and see all of SSH servers, Kubernetes clusters, or databases available to you in one place, and get instant access to them using tools you already have. Teleport ensures best security practices like role-based access, preventing data exfiltration, providing visibility, and ensuring compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.



“Disabling Security Hub controls in a multi account environment”. I hate that this is a solution instead of a native feature, but it’s important. There are some Security Hub controls that are just nonsense. “Oh no, you didn’t encrypt your EBS volumes.” “Oh dear, you haven’t rotated your IAM credentials in 90 days.” “Holy CRAP, the S3 bucket serving static assets to the world is world-readable.” You get the picture.

And a tool I found fun, “Port Knocking” is an old security technique in which you attempt to connect to a host on a predetermined sequence of ports. Get it right and you’re now able to connect to the host in question on the port that you want. ipv6-ghost-ship has done something similar yet ever more ridiculous: It takes advantage of the fact that IPv6 means that each EC2 instance gets 281 trillion IP addresses to only accept SSH connections when the last three octets of the IP address on the instance match the time-based authentication code. This is a ridiculous hack, and I love it oh so very much. I’m Chief Cloud Economist at The Duckbill Group, and this has been Last Week in AWS: Security. Thanks for listening.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 13 Jan 2022 03:00:00 -0800
Azure's Terrible Security Posture Comes Home to Roost

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/azures-terrible-security-posture-comes-home-to-roost/



Never miss an episode



Help the show



What's Corey up to?

Wed, 12 Jan 2022 03:00:00 -0800
LakeTrail for Clouds
AWS Morning Brief for the week of January 10, 2021 with Corey Quinn.
Mon, 10 Jan 2022 03:00:00 -0800
Time to Give LastPass the Heave

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Corey: The first security round-up of the year in Last Week in AWS: Security. This is relatively light, just because it covers the last week of the year, where people didn’t really “Work” so much as “Get into fights on Twitter.” Onward.

So, from the community, ever see a data breach announcement that raises oh so very many more questions than it answers? I swear this headline is from a week or so ago, not 1998: “Tokyo police lose 2 floppy disks containing personal info on 38 public housing applicants”. Yes, I said floppy disks.

The terrible orange website, also known as Hacker News, reports that LastPass may have suffered a breach. At the time I write this, the official LastPass blog has a, “No, it’s just people reusing passwords.” Enough people I trust have seen this behavior that I’d be astounded if that were true. If you can’t trust your password manager, ditch them immediately.

Security Boulevard had a roundup of the “Worst AWS Data Breaches of 2021”, and it’s the usual run-of-the-mill S3 bucket problems, but my personal favorite’s the Twitch breach because it’s particularly embarrassing, given that it is, in fact, an Amazon subsidiary.

First one goes to D.W. Morgan by leaking 100GB of client data. And they’re a logistics company that serves giant enterprises, so these are companies with zero sense of humor, so I would not want to be in D.W. Morgan’s position this week.

And the other is a little funnier. It goes to SEGA Europe, after Sonic the Hedgehog forgets to perform due diligence on his AWS environment.

Corey: Are you building cloud applications with a distributed team? Check out Teleport, an open-source identity-aware access proxy for cloud resources. Teleport provides secure access for anything running somewhere behind NAT: SSH servers, Kubernetes clusters, internal web apps, and databases. Teleport gives engineers superpowers. Get access to everything via single sign-on with multi-factor, list and see all of SSH servers, Kubernetes clusters, or databases available to you in one place, and get instant access to them using tools you already have. Teleport ensures best security practices like role-based access, preventing data exfiltration, providing visibility, and ensuring compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.



AWS had only a single thing that I found interesting: “Identity Guide–Preventive controls with AWS Identity–SCPs”. I’ve been waiting for a while for a good explainer on SCPs to come out for a while, and this looks like it actually is a thing that I want. I’ve been playing around with SCPs a lot more for the past couple of weeks. If you’re unfamiliar, it’s a way to override what the root user can do in an organization’s member accounts. It’s super handy to constrain people from doing things that are otherwise foolhardy.

And lastly, an interesting tool came out from Google—which I should not have to explain what that is to you folks; they turn things off, like Reader—they also released a log4j scanner. This one scans files on disk to detect the bad versions of log4j—which is most of them—and can replace them with the good version—which is, of course, print statements. And that’s what happened last week in AWS security. Hopefully next week will be… well, I don’t want to say less contentful, but I do want to say it’s at least not as exciting as the last month has been. Thanks for listening.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign
up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 06 Jan 2022 03:00:00 -0800
The AWS Service I Hate the Most

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/the-aws-service-i-hate-the-most



Never miss an episode



Help the show



What's Corey up to?

Wed, 05 Jan 2022 03:00:00 -0800
AWS Burninate
AWS Morning Brief for the week of January 3, 2021 with Corey Quinn.
Mon, 03 Jan 2022 03:00:00 -0800
Self-Disclosure Heals Many Wounds

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: Are you building cloud applications with a distributed team? Check out Teleport, an open-source identity-aware access proxy for cloud resources. Teleport provides secure access for anything running somewhere behind NAT: SSH servers, Kubernetes clusters, internal web apps, and databases. Teleport gives engineers superpowers. Get access to everything via single sign-on with multi-factor, list and see all of SSH servers, Kubernetes clusters, or databases available to you in one place, and get instant access to them using tools you already have. Teleport ensures best security practices like role-based access, preventing data exfiltration, providing visibility, and ensuring compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.

Corey: Well, we’re certainly ending 2021 with a whirlwind in the security space. Log4J continues to haunt us, while AWS took not only an
outage but also a bit of a security blunder that they managed to turn into a messaging win. Listen on.

But first, the Community. A depressing review of 2021’s “Cloud Security Breaches and Vulnerabilities.” Honestly, it seems like there are just so damned many ways for bad security to set the things we care about on fire. The takeaways are actionable though. Stop using static long-lived credentials and start with the basics before you get fancy.

Sennheiser scores itself an S3 Bucket Negligence Award, and of all the countries in which to suffer a data breach, I’ve got to say that Germany is at the bottom of the list. They do not mess around with data protection there.



And, Holy hell, AWS inadvertently granted the role its support teams use to access customer accounts access to S3 objects. It lasted for ten hours, and while there are mitigations out there, this is far from the first time that AWS has biffed it with regard to an unreviewed change making it into a managed IAM policy. This needs to be addressed. If you’ve got specific questions about how those things are handled, reach out to your account team; but it’s a terrible look. But there’s more to come in a second here.

Corey: This episode is sponsored in part by my friends at Cloud Academy. Something special for you folks: If you missed their offer on Black Friday or Cyber Monday or whatever day of the week doing sales it is, good news, they’ve opened up their Black Friday promotion for a very limited time. Same deal: $100 off a yearly plan, 249 bucks a year for the highest quality cloud and tech skills content. Nobody else is going to get this, and you have to act now because they have assured me this is not going to last for much longer. Go to cloudacademy.com, hit the ‘Start Free Trial’ button on the homepage and use the promo code, ‘CLOUD’ when checking out. That’s C-L-O-U-D. Like loud—what I am—with a C in front of it. They’ve got a free trial, too, so you’ll get seven days to try it out to make sure it really is a good fit. You’ve got nothing to lose except your ignorance about cloud. My thanks to Cloud Academy once again for sponsoring my ridiculous nonsense.

A bit off the beaten path, this week’s S3 Bucket Negligence Award goes to the government of Ghana. This one is pretty bad. I mean, you can’t exactly opt out of doing business with your government, you know?

Now, AWS has two things I want to talk about. The first is that they offer a way to “Simplify setup of Amazon Detective with AWS Organizations.” I’m actually enthusiastic about this one because there’s a significant lack of security tooling available to folks at the lower end of the market. A bunch of companies seem to start off targeting this segment, but soon realize that there’s a better future in selling things to bigger companies for $200,000 a month instead of $20.

Now, “AWSSupportServiceRolePolicy Informational Update.” Now, you heard a minute ago, I was initially extremely unhappy about this mistake. That said, I am such a fan of this notification that I can’t even articulate it without sounding like I’m fanboying. Because mistakes happen and talking about those mistakes and why defense in depth mitigates the harm of those mistakes goes a long way. This affirms my trust in AWS rather than harming it. Meanwhile Azure has absolutely nothing to say about why their tenant separation is aspirational at best.



And lastly a bit of tooling story here. To end up the year, I’ve been kicking the tires on aws-sso-cli over on GitHub, which is a tool for using AWS SSO for both the CLI and web console. I don’t know why the native SSO tooling is quite as trash as it is, but it’s a problem. There’s a lot of value to using SSO but AWS hides it as if the entire thing were under NDA. Thank you for listening. It’s been a heck of a year as we’ve launched the security portion of this weekly nonsense. I’ll talk to you more in 2022. Stay safe.



Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 30 Dec 2021 03:00:00 -0800
Last Year in AWS

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/last-year-in-aws



Never miss an episode



Help the show



What's Corey up to?

Wed, 29 Dec 2021 03:00:00 -0800
Managed Grifting Service Now in Preview
AWS Morning Brief for the week of December 27, 2021 with Corey Quinn.
Mon, 27 Dec 2021 03:00:00 -0800
Yule4j

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Announcer: Are you building cloud applications with a distributed team? Check out Teleport, an open-source identity-aware access proxy for cloud resources. Teleport provides secure access for anything running somewhere behind NAT: SSH servers, Kubernetes clusters, internal web apps, and databases. Teleport gives engineers superpowers. Get access to everything via single sign-on with multi-factor, list and see all of SSH servers, Kubernetes clusters, or databases available to you in one place, and get instant access to them using tools you already have. Teleport ensures best security practices like role-based access, preventing data exfiltration, providing visibility, and ensuring compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.

Corey: The burning yule log that is the log4j exploit and its downstream issues continues to burn fiercely. Meanwhile the year winds down, and it’s certainly been an eventful one. I’ll talk to you next week because that is what I do.

Now, let’s see from the community what happened. The patch to fix the log4j vulnerability apparently has its own vulnerability that’s actively under exploit. Find your nearest InfoSec friend and buy them a beer or forty because this is going to suck for a long time and basically ruin everyone’s holiday.


Also, I’ve seen the most hair-raising thing I can remember in InfoSec-land, which is the Google Project Zero deep dive into the NSO group’s iMessage exploit. Seriously, this thing requires no clicks on the part of the victim, the exploit uses a bug in the GIF processing inherent to iMessage to build a virtual CPU and assembly instruction set. There is no realistic defense against this short of hurling your phone into the sea, which I heartily recommend at this point as a best practice.

Oh, and everything is on fire and somehow worse. There are now at least three flaws in the log4j library that we’re counting, so far. Everything is terrible and we clearly should never log anything again.



Corey: This episode is sponsored in part by my friends at Cloud Academy. Something special for you folks: If you missed their offer on Black Friday or Cyber Monday or whatever day of the week doing sales it is, good news, they’ve opened up their Black Friday promotion for a very limited time. Same deal: $100 off a yearly plan, 249 bucks a year for the highest quality cloud and tech skills content. Nobody else is going to get this, and you have to act now because they have assured me this is not going to last for much longer. Go to cloudacademy.com, hit the ‘Start Free Trial’ button on the homepage and use the promo code, ‘CLOUD’ when checking out. That’s C-L-O-U-D. Like loud—what I am—with a C in front of it. They’ve got a free trial, too, so you’ll get seven days to try it out to make sure it really is a good fit. You’ve got nothing to lose except your ignorance about cloud. My thanks to Cloud Academy once again for sponsoring my ridiculous nonsense.

Now, AWS had a few things to say. The most relevant of them are How to customize behavior of AWS Managed Rules for WAF. So, if you’re a
WAF vendor and you don’t link to this blog post as part of your, “Why should I pay you?” sales material, you’re missing a golden opportunity. Every time I dig into AWS’s Web Application Firewall offering, I end up regretting it, and with a headache.

There was also a post on Using AWS security services to protect against, detect, and respond to the Log4j vulnerability. I’m disappointed to see AWS starting to use the log4nonsense stuff to pitch a dizzying array of expensive security services that require customers to do an awful lot of independent work to get stuff configured properly. This kind of isn’t the time for that.

And they have an update page that they continue to update called Update for Apache Log4j2 Issue, and this post has more frequent updates than AWS’s “What’s new” RSS feed. It really drives home the sheer scope of the issue, how pervasive it is, and just how much empathy we should have for the AWS security team. Their job has pretty clearly been not fun for the last couple of weeks.

And lastly, the tip of the week is more of a request for help, honestly. I asked what I thought was an innocent question on Twitter: “What are people using to read and consume CloudTrail logs?” The answers made it clear that the answer was basically, “A bunch of very expensive enterprise grade things,” or, “Nothing.” This feels like a missed opportunity for some enterprising company out there. If you’ve got a better answer here, please whack reply and let me know. You know where to find me. Thanks for listening. That’s what happened last week in AWS
security. Enjoy the time off if you’re lucky enough to get any, and I’ll talk to you next week.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 23 Dec 2021 03:00:00 -0800
Overstating AWS's Free Tier Generosity

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/overstating-awss-free-tier-generosity



Never miss an episode



Help the show



What's Corey up to?

Wed, 22 Dec 2021 03:00:00 -0800
Amazon Lookout for Twitter
AWS Morning Brief for the week of December 20, 2021 with Corey Quinn.
Mon, 20 Dec 2021 04:48:59 -0800
...And Now Everything Is On Fire

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: It seems like there is a new security breach every day. Are you confident that an old SSH key or a shared admin account isn’t going to come back and bite you? If not, check out Teleport. Teleport is the easiest, most secure way to access all of your infrastructure. The open-source Teleport Access Plane consolidates everything you need for secure access to your Linux and Windows servers—and I assure you there is no third option there. Kubernetes clusters, databases, and internal applications like AWS Management Console, Yankins, GitLab, Grafana, Jupyter Notebooks, and more. Teleport’s unique approach is not only more secure, it also improves developer productivity. To learn more, visit goteleport.com. And no, that’s not me telling you to go away; it is, goteleport.com.

Corey: I think I owe the entire internet a massive apology. See, last week I titled the episode, “A Somehow Quiet Security Week.” This is the equivalent of climbing to the top of a mountain peak during a violent thunderstorm, then waving around a long metal rod. While cursing God.

So, long story short, the internet is now on fire due to a vulnerability in the log4j open-source logging library. Effectively, if you can get an arbitrary string into the logs of a system that uses a vulnerable version of the log4j library, it will make outbound network requests. It can potentially run arbitrary code.

The impact is massive and this one’s going to be with us for years. WAF is a partial solution, but the only real answer is to patch to an updated version, or change a bunch of config options, or disallow affected systems from making outbound connections. Further, due to how thoroughly embedded in basically everything it is—like S3; more on that in a bit—a whole raft of software you run may very well be using this without your knowledge. This is, to be clear, freaking wild. I am deeply sorry for taunting fate last week. The rest of this issue of course talks entirely about this one enormous concern.

Corey: This episode is sponsored in part by my friends at Cloud Academy. Something special for you folks: if you missed their offer on Black Friday or Cyber Monday or whatever day of the week doing sales it is, good news, they’ve opened up their Black Friday promotion for a very limited time. Same deal: $100 off a yearly plan, 249 bucks a year for the highest quality cloud and tech skills content. Nobody else is going to get this, and you have to act now because they have assured me this is not going to last for much longer. Go to cloudacademy.com, hit the ‘Start Free Trial’ button on the homepage and use the promo code, ‘CLOUD’ when checking out. That’s C-L-O-U-D. Like loud—what I am—with a C in front of it. They’ve got a free trial, too, so you’ll get seven days to try it out to make sure it really is a good fit. You’ve got nothing to lose except your ignorance about cloud. My thanks to Cloud Academy once again for sponsoring my ridiculous nonsense.

Cloudflare has a blog post talking about the timeline of what they see as a global observer of exploitation attempts of this nonsense. They’re automatically shooting it down for all of their customers and users—to be clear, if you’re not paying for a service you are not its customer, you’re a marketing expense—and they’re doing this as part of the standard service they provide. Meanwhile AWS’s WAF has added the ruleset to its AWSManagedRulesKnownBadInputsRuleSet—all one word—managed rules—wait a minute; they named it that? Oh, AWS. You sad, ridiculous service-naming cloud. But yeah, you have to enable AWS WAF, for which there is effectively no free tier, and configure this rule to get its protection, as I read AWS’s original update. I’m sometimes asked why I use CloudFlare as my CDN instead of AWS’s offerings. Well, now you know.

Also, Kronos, an HR services firm, won the ransomware timing lottery. They’re expecting to be down for weeks, but due to the log4shell—which is what they’re calling this exploit: The log4shell problem—absolutely nobody is paying attention to companies that are having ransomware problems or data breaches. Good job, Kronos.

Now, what did AWS have to say? Well, they have an ongoing “Update for the Apache Log4j2 Issue” and they’ve been updating it as they go. But at the time of this recording, AWS is a Java shop, to my understanding.

That means that basically everything internet-facing at AWS—which is, you know, more or less everything they sell—has some risk exposure to this vulnerability. And AWS has moved with a speed that can only be described as astonishing, and mitigated this on their managed services in a timeline I wouldn’t have previously believed possible given the scope and scale here. This is the best possible argument to make for using higher-level managed services instead of building your own things on top of EC2. I just hope they’re classy enough not to use that as a marketing talking point.

And for the tool of the week, the Log4Shell Vulnerability Tester at log4shell.huntress.com automatically generates a string and then lets you know when that is exploited by this vulnerability what systems are connecting to is. Don’t misuse it obviously, but it’s great for validating whether a certain code path in your environment is vulnerable. And that’s what happened last week in AWS Security, and I just want to say again how deeply, deeply sorry I am for taunting fate and making everyone’s year suck. I’ll talk to you next week, if I live.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 16 Dec 2021 03:00:00 -0800
Lessons in Trust from us-east-1

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/lessons-in-trust-from-us-east-1



Never miss an episode



Help the show



What's Corey up to?

Wed, 15 Dec 2021 03:00:00 -0800
us-east-1 of Eden
AWS Morning Brief for the week of December 13, 2021 with Corey Quinn.
Mon, 13 Dec 2021 03:00:00 -0800
A Somehow Quiet Security Week

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: Are you building cloud applications with a distributed team? Check out Teleport, an open-source identity-aware access proxy for cloud resources. Teleport provides secure access for anything running somewhere behind NAT: SSH servers, Kubernetes clusters, internal web apps, and databases. Teleport gives engineers superpowers. Get access to everything via single sign-on with multi-factor. List and see all of SSH servers, Kubernetes clusters, or databases available to you in one place, and get instant access to them using tools you already have. Teleport ensures best security practices like role-based access, preventing data exfiltration, providing visibility, and ensuring compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.

Corey: re:Invent has come and gone, and with it remarkably few security announcements. Shockingly, it was a slow week for the industry. I’m glad but also disappointed to be proven wrong in my, “The only thing you, as a company who isn’t AWS, should be announcing during re:Invent is your data breach since nobody will be paying attention,” snark. But it’s for the best. It means that maybe—maybe—we’re starting to see things normalize a bit.



Now, from the Community, we saw some interesting stuff. Scuttlebutt has it that cyber-security insurance providers are increasing their requirements to be insurable. This makes a lot of sense; as ransomware attacks become more numerous, nobody is going to want to cut large insurance checks to folks who didn’t think to have offline backups. You might want to check the specific terms and conditions of your policy.



I also liked a writeup as to “Why the C-suite doesn’t need access to all corporate data.” It’s true, but it’s super hard to defend against. When the CTO ‘requests’ access to the AWS root account, who’s likely to say no? If you’re going to push for proper separation of duties, either do it the right way or don’t even bother.


Corey: This episode is sponsored in part by my friends at Cloud Academy. Something special for you folks: if you missed their offer on Black Friday or Cyber Monday or whatever day of the week doing sales it is, good news, they’ve opened up their Black Friday promotion for a very limited time. Same deal: $100 off a yearly plan, 249 bucks a year for the highest quality cloud and tech skills content. Nobody else is going to get this, and you have to act now because they have assured me this is not going to last for much longer. Go to cloudacademy.com, hit the ‘Start Free Trial’ button on the homepage and use the promo code, ‘CLOUD’ when checking out. That’s C-L-O-U-D. Like loud—what I am—with a C in front of it. They’ve got a free trial, too, so you’ll get seven days to try it out to make sure it really is a good fit. You’ve got nothing to lose except your ignorance about cloud. My thanks to Cloud Academy once again for sponsoring my ridiculous nonsense.



Corey: And from AWS, there was really one glaring announcement that made me happy in the security context, and that was that “Amazon S3 Object Ownership can now disable access control lists to simplify access management for data in S3,” and it’s huge. S3 ACLs have been a pain in everyone’s side for years. Remember that S3 was the first AWS service to general availability, and a second in beta, after SQS. Meanwhile, IAM wasn’t released until 2010. “Ignore bucket ACLs so you don’t have to think about them” is a huge step towards normalizing security within AWS, specifically S3.

And from the community's tools—I guess it’s not a tool so much as it is a tip or I don’t even know how you would describe it but I love it because Scott Piper is doing the lord’s work by curating a list of cloud provider security mistakes. Lord knows that none of them are going to be showcasing their own failures, or—thankfully—those of their competition because I don’t want to get in the middle of that mudslinging prize. This is well worth checking out and taking a look at, particularly when one provider or another starts getting a little too full of themselves around what they’re doing in security. That’s what happened last week in AWS security. Thank you for listening.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 09 Dec 2021 03:00:00 -0800
How AWS Measures Customer Numbers

Want to give your ears a break and read this as an article? You’re looking for this link.
https://www.lastweekinaws.com/blog/how-aws-measures-its-customers



Never miss an episode



Help the show



What's Corey up to?

Wed, 08 Dec 2021 03:00:00 -0800
Releases of re:Invent

Releasees of re:Invent Lyrics

AWS Backup speaks S3
Systems Manager: RDP
Improvements have hit Control Tower
Systems Manager speaks Greengrass
Evidently's name sucks ass
(It does A/B testing by the hour)

Streams in Kinesis
EMR and Jesus
MSK are now Serverless
Redshift is too
And this one should please you
FSx supports OpenZFS

Make development faster
Without a disaster
Too dangerous to go alone
You might give them a slappin'
For making this happen
But please go check out HoneyComb

Data Transfer new Free Tier
Slightly more free as in beer
So your bill is a bit less absurd
Don't use CloudWatch RUM
AWS is your chum
In the bloody sense of the word

They can't remain nameless
Thank You to Blameless
For helping out with SRE
It goes beyond on-call
And most importantly of all
Fingers aren’t pointing at me

DMS Fleet Advisor
The Sages get wiser
(SageMaker got features but I just don't care)
Now let’s show more respect
To our friend FSx’s
OpenZFS support if you unaware

It impressed me a boatload
Amplify Studio's Low Code
But Amazon's scared of that phrase
Digital TwinMaker
Stuff for data lakers
OpenZFS deserves so much praise

RoboRunner runs robots
Archive for EBS snapshots
In case all your instances crash
If your users all sin
EBS Snapshot Recycle Bin
But they likely belong in the trash

“Cloud WAN” “Evidently”
“Private 5G” “Snow Family”
And SageMaker Ground Truth Plus
But I won't be shaming
Since the one person naming
Things well just got hit by a bus

Thanks go to Netlify
More deadly than Jai Alai
To AWS's clear JAMstack flex
Sure you could use S3
ACM CloudFront and Route53
That's just Netlify with extra steps


CDK V2 sounds like a bust
SDKs for Swift Kotlin and Rust
Construct Hub has launched into GA
Network Analyzer for VPC
Disable ACLs in S3
Storage admins will have a field day


Block regions within Control Tower
Compute optimizer bills you per picohour
Now the Snow Family speaks tape
Workspaces Web does you favors
EC2 has many more flavors
But I still go for Cherry and Grape

You knew this was coming
Because for four years running
It's sponsored by ChaosSearch
It speaks just like Elastic
Now does SQL more drastic
If you want to spend more
Then get out of my church

Stuff for the telecom sector
There's a new Inspector
That's sneakily powered by Snyk
Resilience Hub to fight failure
The Karpenter auto-scaler's
Either written in Go or in Greek

So Amazon is transitioning
Thank you for listening
To all of the nonsense I say
Now I’m going home
Where I can be alone
And I’ll probably be sleeping ‘till May.




Mon, 06 Dec 2021 03:00:00 -0800
re:Quinnvent Day 5
AWS Morning Brief for Day 5 of re:Quinnvent on Friday, December 5 with Corey Quinn.
Fri, 03 Dec 2021 10:20:44 -0800
re:Quinnvent Day 4
AWS Morning Brief for Day 4 of re:Quinnvent on Thursday, December 2 with Corey Quinn.
Thu, 02 Dec 2021 08:09:05 -0800
re:Invent Week

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Corey: “Security is Job Zero” according to AWS. Next week I’ll have a fair bit on that I suspect, since this week is re:Invent. Let’s see what happened before the storm hit.



IBM put out its annual Cost of a Data Breach Report which is interesting, but personally I find it genius. This is how you pollute SEO for the
search term ‘IBM Data Breach’, which is surely just a matter of time if it hasn’t already happened.



Speaking of, GoDaddy effectively got its ass handed to it in a security breach last week. We found out of course via an SEC filing instead of GoDaddy doing the smart thing and proactively getting in front of it. Apparently they were breached for at least two-and-a-half months, nobody noticed, and 1.2 million people got their admin creds stolen. I can’t stress enough that you should not be doing business with
GoDaddy.

And to complete the trifecta, ‘Millions of Brazilians’ is a fun thing to say unless you’re talking about who’s been victimized by an S3 Bucket Negligence Award; then nobody’s having fun at all.

The AWS security blog had a few things to say. “You can now securely connect to your Amazon MSK clusters over the internet.” Wait, what? What the hell was going on before? Were you unable to access the clusters over the internet, or were you able to do so but it was insecurely? This is terrifying framing.



AWS Security Profiles: Megan O’Neil, Sr. Security Solutions Architect.” I really dig these! The problem is that the AWS security blog only really seems to put these out around major AWS conferences when there’s a bunch of other announcements. I’d love it if more of the AWS blogs would do periodic “The faces, voices, and people that power AWS” profiles because I assure you, most of the people building the magic never take the stage at these conferences.

There was another profile of Merritt Baer. Who is a principal in the office of the CISO, and she’s an absolute delight. One of these days, post-pandemic, we’re going to try and record some kind of video or other, just so we can name it “Quinn and Baer it.”

Corey: This episode is sponsored in part by something new. Cloud Academy is a training platform built on two primary goals: having the highest quality content in tech and cloud skills, and building a good community that is rich and full of IT and engineering professionals. You wouldn’t think those things go together, but sometimes they do. It’s both useful for individuals and large enterprises, but here’s what makes this something new—I don’t use that term lightly—Cloud Academy invites you to showcase just how good your AWS skills are. For the next four weeks, you’ll have a chance to prove yourself. Compete in four unique lab challenges where they’ll be awarding more than $2,000 in cash and prizes. I’m not kidding: first place is a thousand bucks. Pre-register for the first challenge now, one that I picked out myself on Amazon SNS image resizing, by visiting cloudacademy.com/corey—C-O-R-E-Y. That’s cloudacademy.com/corey. We’re going to have some fun with this one.

Corey: And of course, “Macie Classic alerts that derive from AWS CloudTrail global service events for AWS Identity and Access Management (IAM) and AWS Security Token Service (STS) API calls will be retired (no longer generated) in the us-west-2 (Oregon) AWS Region.” See, that’s one of those super important things to know, and I hate how AWS buries it. That said, don’t use Macie Classic because it is horrifyingly expensive compared to modern Macie.

And from the tools and tricks area, I discovered permissions.cloud last week and it’s great. The website uses a variety of information gathered within the IAM dataset and then exposes that information in a clean, easy-to-read format. It’s there to provide an alternate community-driven source of truth for AWS identity. It’s gorgeous as well, so you know it’s not an official AWS product.



And that’s what happened in AWS security. Thank you for listening. I’ll talk to you next week if I survive re:Invent.



Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow
AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 02 Dec 2021 03:00:00 -0800
re:Quinnvent Day 3
AWS Morning Brief for Day 3 of re:Quinnvent on Wednesday, December 1 with Corey Quinn.
Wed, 01 Dec 2021 06:00:00 -0800
Amazon Linux 2022: Codename setenforce 0

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/amazon-linux-2022-codename-setenforce-0



Never miss an episode



Help the show



What's Corey up to?

Wed, 01 Dec 2021 03:00:00 -0800
re:Quinnvent Day 2
AWS Morning Brief for Day 2 of re:Quinnvent on Tuesday, November 30 with Corey Quinn.
Tue, 30 Nov 2021 06:00:00 -0800
re:Quinnvent Day 1
AWS Morning Brief for Day 1 of re:Quinnvent on Monday, November 29th, 2021 with Corey Quinn.
Mon, 29 Nov 2021 06:07:51 -0800
re:Quinnvent Week
AWS Morning Brief for the week of November 29, 2021 with Corey Quinn.
Mon, 29 Nov 2021 03:00:00 -0800
AWS Security Services Cost More Than The Breach

Links


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: Writing ad copy to fit into a 30-second slot is hard, but if anyone can do it the folks at Quali can. Just like their Torque infrastructure automation platform can deliver complex application environments anytime, anywhere, in just seconds instead of hours, days, or weeks. Visit Qtorque.io today, and learn how you can spin up application environments in about the same amount of time it took you to listen to this ad.

Corey: Happy Thanksgiving. Lacework raised an eye-popping $1.3 billion in funding last week. I joke about it being a result of them sponsoring this podcast, for which I thank them, but that’s not the entire story. “Why would someone pay for Lacework when AWS offers a bunch of security services?” Is a reasonable question. The answer is that AWS offers a bunch of security services, doesn’t articulate how they all fit together super well, and the cost of running them all on a busy account likely exceeds the cost of a data breach. Security has to be simple to understand. An architecture diagram that looks busier than a London Tube map is absolutely not that. Cloud services are complex, but inside of that complexity lies a lot of room for misconfiguration. Being condescendingly told after the fact about AWS’s Shared Responsibility Model is cold comfort. Vendors who can simplify that story and deliver on that promise stand to win massively here.

Now, let’s see what happened last week. The NSA and CISA have a new set of security guidelines for 5G networks. I’m sorry, but what about this is specific to 5G networks? It’s all about zero trust, assuming that any given node inside the perimeter might be compromised, and the like. None of this is particularly germane to 5G, so I’ve got to ask, what am I missing?

A company called RedDoorz—spelled with a Z, because of course it is—was fined by Singapore’s regulatory authority for leaking 5.9 million records. That’s good. The fine was $54,456 USD, which seems significantly less good? I mean, that’s “Cost of doing business” territory when you’re talking about data breaches. In an ideal world it would hurt a smidgen more as a goad to inspire companies to do better than they are?
Am I just a dreamer here?

I found a list of 4 Security Questions to Ask About Your Salesforce Application, and is great, and I don’t give a toss about the Salesforce aspect of it. They are, one, who are the users with excessive privileges? Two, what would happen if a legitimate user started acting in a suspicious way? Three, what would happen if a threat actor gained access to sensitive data through a poor third-Party integration? And, four, what would happen if your incident log is not properly configured? These are important questions to ask about basically every application in your environment. I promise, you probably won’t like the answers—but attackers ask them constantly. You should, too.



Corey: This episode is sponsored in part by something new. Cloud Academy is a training platform built on two primary goals: having the highest quality content in tech and cloud skills, and building a good community that is rich and full of IT and engineering professionals. You wouldn’t think those things go together, but sometimes they do. It’s both useful for individuals and large enterprises, but here’s what makes this something new—I don’t use that term lightly—Cloud Academy invites you to showcase just how good your AWS skills are. For the next four weeks, you’ll have a chance to prove yourself. Compete in four unique lab challenges where they’ll be awarding more than $2,000 in cash and prizes. I’m not kidding: first place is a thousand bucks. Pre-register for the first challenge now, one that I picked out myself on Amazon SNS image resizing, by visiting cloudacademy.com/corey—C-O-R-E-Y. That’s cloudacademy.com/corey. We’re going to have some fun with this one.

Corey: Now, from the mouth of AWS horse, there was an interesting article there. Managing temporary elevated access to your AWS environment. Now, this post is complicated, but yes, ideally users shouldn’t be using accounts with permissions to destroy production in day-to-day use; more restricted permissions should be used for daily work, and then people elevate to greater permissions only long enough to perform a task that requires them. That’s the Linux ‘sudo’ model. Unfortunately, implementing this is hard and ‘sudo zsh’ is often the only command people ever run from their non-admin accounts.

And one more. Everything you wanted to know about trusts with AWS Managed Microsoft AD. Look, I don’t touch these things myself basically ever. I haven’t done anything with Active Directory since the mid-naughts, and I don’t want to know anything about them. That said, I do accept that others will care about it and that’s why I mention it. I’m here for you.

And lastly, as far as tools go, have you ever tried to work with CloudTrail logs yourself? Yeah, you might have noticed the experience was complete crap. This is why I talk about trailscraper, which I discovered last week. It makes it way easier to look for specific patterns in your logs, or even just grab the logs in non-compressed format to work with more easily. And that’s what happened last week in the world of AWS security. Next week is re:Invent, and Lord alone knows what nonsense we’re going to uncover then. Strap in, it’s going to be an experience. Thanks for listening.



Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.



Announcer: This has been a HumblePod production. Stay humble.Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.



Corey: Writing ad copy to fit into a 30-second slot is hard, but if anyone can do it the folks at Quali can. Just like their Torque infrastructure automation platform can deliver complex application environments anytime, anywhere, in just seconds instead of hours, days, or weeks. Visit Qtorque.io today, and learn how you can spin up application environments in about the same amount of time it took you to listen to this ad.

Corey: Happy Thanksgiving. Lacework raised an eye-popping $1.3 billion in funding last week. I joke about it being a result of them sponsoring this podcast, for which I thank them, but that’s not the entire story. “Why would someone pay for Lacework when AWS offers a bunch of security services?” Is a reasonable question. The answer is that AWS offers a bunch of security services, doesn’t articulate how they all fit together super well, and the cost of running them all on a busy account likely exceeds the cost of a data breach. Security has to be simple to understand. An architecture diagram that looks busier than a London Tube map is absolutely not that. Cloud services are complex, but inside of that complexity lies a lot of room for misconfiguration. Being condescendingly told after the fact about AWS’s Shared Responsibility Model is cold comfort. Vendors who can simplify that story and deliver on that promise stand to win massively here.



Now, let’s see what happened last week. The NSA and CISA have a new set of security guidelines for 5G networks. I’m sorry, but what about this is specific to 5G networks? It’s all about zero trust, assuming that any given node inside the perimeter might be compromised, and the like. None of this is particularly germane to 5G, so I’ve got to ask, what am I missing?

A company called RedDoorz—spelled with a Z, because of course it is—was fined by Singapore’s regulatory authority for leaking 5.9 million records. That’s good. The fine was $54,456 USD, which seems significantly less good? I mean, that’s “Cost of doing business” territory when you’re talking about data breaches. In an ideal world it would hurt a smidgen more as a goad to inspire companies to do better than they are? Am I just a dreamer here?

I found a list of 4 Security Questions to Ask About Your Salesforce Application, and is great, and I don’t give a toss about the Salesforce aspect of it. They are, one, who are the users with excessive privileges? Two, what would happen if a legitimate user started acting in a suspicious way? Three, what would happen if a threat actor gained access to sensitive data through a poor third-Party integration? And, four, what would happen if your incident log is not properly configured? These are important questions to ask about basically every application in your environment. I promise, you probably won’t like the answers—but attackers ask them constantly. You should, too.

Corey: This episode is sponsored in part by something new. Cloud Academy is a training platform built on two primary goals: having the highest quality content in tech and cloud skills, and building a good community that is rich and full of IT and engineering professionals. You wouldn’t think those things go together, but sometimes they do. It’s both useful for individuals and large enterprises, but here’s what makes this something new—I don’t use that term lightly—Cloud Academy invites you to showcase just how good your AWS skills are. For the next four weeks, you’ll have a chance to prove yourself. Compete in four unique lab challenges where they’ll be awarding more than $2,000 in cash and prizes. I’m not kidding: first place is a thousand bucks. Pre-register for the first challenge now, one that I picked out myself on Amazon SNS image resizing, by visiting cloudacademy.com/corey—C-O-R-E-Y. That’s cloudacademy.com/corey. We’re going to have some fun with this one.

Corey: Now, from the mouth of AWS horse, there was an interesting article there. Managing temporary elevated access to your AWS environment. Now, this post is complicated, but yes, ideally users shouldn’t be using accounts with permissions to destroy production in day-to-day use; more restricted permissions should be used for daily work, and then people elevate to greater permissions only long enough to perform a task that requires them. That’s the Linux ‘sudo’ model. Unfortunately, implementing this is hard and ‘sudo zsh’ is often the only command people ever run from their non-admin accounts.

And one more. Everything you wanted to know about trusts with AWS Managed Microsoft AD. Look, I don’t touch these things myself basically ever. I haven’t done anything with Active Directory since the mid-naughts, and I don’t want to know anything about them. That said, I do accept that others will care about it and that’s why I mention it. I’m here for you.

And lastly, as far as tools go, have you ever tried to work with CloudTrail logs yourself? Yeah, you might have noticed the experience was complete crap. This is why I talk about trailscraper, which I discovered last week. It makes it way easier to look for specific patterns in your logs, or even just grab the logs in non-compressed format to work with more easily. And that’s what happened last week in the world of AWS security. Next week is re:Invent, and Lord alone knows what nonsense we’re going to uncover then. Strap in, it’s going to be an experience. Thanks for listening.



Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign
up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 25 Nov 2021 03:00:00 -0800
The AWS Managed NAT Gateway is Unpleasant and Not Recommended

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/The-AWS-Managed-NAT-Gateway-is-Unpleasant-and-Not-Recommended



Never miss an episode



Help the show



What's Corey up to?

Wed, 24 Nov 2021 03:00:00 -0800
Benjamin Button, AWS Monitron Product Manager
AWS Morning Brief for the week of November 22, 2021 with Corey Quinn.
Mon, 22 Nov 2021 03:00:00 -0800
Cloud Security Should Be Boring

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: Writing ad copy to fit into a 30 second slot is hard, but if anyone can do it the folks at Quali can. Just like their Torque infrastructure automation platform can deliver complex application environments anytime, anywhere, in just seconds instead of hours, days or weeks. Visit Qtorque.io today and learn how you can spin up application environments in about the same amount of time it took you to listen to this ad.

Corey: As I prepare for re:Quinnvent, I notice that most of the flurry of announcements aren’t centered around security. This is probably for the best; if security becomes too exciting, you might be an Azure customer. Onward.



Let’s dive into what the whole Azure challenge is. The researcher who discovered the CosmosDB vulnerability that Azure suffered back in September have come out with a deeper dive into what they did and how they did it, and it is oh so very much worse than we thought. They were able to get access to the CosmosDB control plane itself.



Microsoft has continued to say nothing about this, in spite of lingering questions such as, “How on earth did you not detect what amounts to a hypervisor escape?” “Holy God, why did you architect these systems without strict tenant isolation in mind since the beginning?” “How are customers supposed to trust anything you’re selling from a security perspective?” And, “What kind of clown shop are you people running over there?”

Separately—and this is kind of amazing—a ransomware hacker gang publicly apologized and removed some of their stolen data because one of their victims was accidentally Mohammed bin Salman. You know, the crown prince of Saudi Arabia who resolves his differences with journalists via hit squads equipped with bone saws. These folks want to do crime, but the right level of crime; you know, the failure mode of, “Being extradited to serve time in a US federal prison,” not, “Being dismembered with a bone saw.”

Corey: This episode is sponsored in part by something new. Cloud Academy is a training platform built on two primary goals. Having the highest quality content in tech and cloud skills, and building a good community the is rich and full of IT and engineering professionals. You wouldn’t think those things go together, but sometimes they do. Its both useful for individuals and large enterprises, but here's what makes it new. I don’t use that term lightly. Cloud Academy invites you to showcase just how good your AWS skills are. For the next four weeks you’ll have a chance to prove yourself. Compete in four unique lab challenges, where they’ll be awarding more than $2000 in cash and prizes. I’m not kidding, first place is a thousand bucks. Pre-register for the first challenge now, one that I picked out myself on Amazon SNS image resizing, by visiting cloudacademy.com/corey. C-O-R-E-Y. That’s cloudacademy.com/corey. We’re gonna have some fun with this one!

AWS didn’t include much in the way of interest for security this week, so I’m going to draw your attention to AWS Artifact. It’s not a service in the traditional sense, but rather a no-cost, self-service portal for on-demand access to AWS’ compliance reports, of which there are oh so very many. You used to have to get these one-by-one from your account team under NDA; don’t do that. And for God’s sake don’t write your own. Grab these reports, throw them at your auditor, and get back to doing things that actually appear in your job description instead.

Let’s talk about tools. Policy Sentry came out of Salesforce and is deceptively simple in concept: it makes it way easier to write simple, narrowly scoped IAM policies. This is what the official IAM Access Analyzer wishes it were, but it’s simply not there yet.

And it’s also been a while since I dug into Prowler. Prowler is a command-line tool that helps you with AWS security assessment, auditing, hardening and incident response. Like most things that focus on CIS benchmarks, you’ll need to apply judgement. An awful lot of things in a responsible, secure environment make sense, but set off alarms from those benchmarks that are considerably more naive. And that’s what happened last week in security in the world of AWS. We have an interesting couple of weeks coming ahead. I’ll be talking to you more next week.

Thu, 18 Nov 2021 03:00:00 -0800
My re:Quinnvent Justification Letter 2021

Want to give your ears a break and read this as an article? You’re looking for this link:
https://www.lastweekinaws.com/blog/my-re-quinnvent-justification-letter



Never miss an episode



Help the show



What's Corey up to?

Wed, 17 Nov 2021 03:00:00 -0800
The AWS East West Canada Region
AWS Morning Brief for the week of November 15, 2021 with Corey Quinn.
Mon, 15 Nov 2021 03:00:00 -0800
Stop Embedding Credentials

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: Writing ad copy to fit into a 30-second slot is hard, but if anyone can do it the folks at Quali can. Just like their Torque infrastructure automation platform can deliver complex application environments anytime, anywhere, in just seconds instead of hours, days, or weeks. Visit Qtorque.io today, and learn how you can spin up application environments in about the same amount of time it took you to listen to this ad.

Corey: It’s a pretty quiet week on the AWS security front because I’m studiously ignoring Robinhood’s breach. There’s nothing to see here.

So, Ransomware sucks and it’s getting worse. Kevin Beaumont wrote a disturbing article earlier this summer—that I just stumbled over, so it’s new to me—about how we effectively aren’t prepared for what’s happening in the ransomworld space. It’s a new battle with new rules, and we haven’t seen the worst of it by far. Now look, alarmism is easy to come by, but Kevin is very well respected in this space for a reason; when he speaks, smart people listen.

If you do nothing else for me this week, please, please, please be careful with credentials. Don’t embed them into apps you ship other places; don’t hardcode them into your apps; ideally for those applications you run on AWS itself you use instance or function or whatever roles that have ephemeral credentials. Because if you don’t, someone may steal them like they did with Kaspersky’s Amazon SES token and use it for Office365 phishing attacks.

And I found analysis that I rather liked about the Twitch breach—although I believe they pronounce it ‘Twetch’. It emphasizes that this stuff is hard, and it talks about the general principles that you should be considering with respect to securing cloud apps. Contrary to the narrative some folks are spinning, Twitch engineers were neither incompetent nor careless, as a general rule.

Corey: This episode is sponsored in part by something new. Cloud Academy is a training platform built on two primary goals: having the highest quality content in tech and cloud skills and building a good community that is rich and full of IT and engineering professionals. You wouldn’t think those things go together, but sometimes they do. It’s both useful for individuals and large enterprises, but here’s what makes this something new—I don’t use that term lightly—Cloud Academy invites you to showcase just how good your AWS skills are. For the next four weeks, you’ll have a chance to prove yourself. Compete in four unique lab challenges where they’ll be awarding more than $2,000 in cash and prizes. I’m not kidding: first place is a thousand bucks. Pre-register for the first challenge now, one that I picked out myself on Amazon SNS image resizing, by visiting cloudacademy.com/corey—C-O-R-E-Y. That’s cloudacademy.com/corey. We’re going to have some fun with this one.

There was an AWS post: Implement OAuth 2.0 device grant flow by using Amazon Cognito and AWS Lambda. Awkward title but I like the principle here. The challenge I have is that Cognito is just. So. Difficult. I don’t think I’m the only person who feels this way.

Objectively, using Cognito is the best sales pitch I can imagine for FusionAuth or Auth0. I’m hoping for a better story at re:Invent this year from the Cognito team, but I’ve been saying that for three years now. The problem with the complexity is that once it’s working—huzzah, at great expense and difficulty—you’ll move on to other things; nobody is going to be able to untangle what you’ve done without at least as much work in the future, should things change. If it isn’t simple, I question its security just due to the risk of misconfiguration.

And this is—I don’t know if this is a tool or a tip; it’s kind of both. If you’re using AWS, which I imagine if you’re listening to this, you probably are, let me draw your attention to Systems Manager Parameter Store. Great service, dumb name. I use it myself constantly for things that are even slightly sensitive. And those things range from usernames to third-party credentials to URL endpoints for various things.



Think of it as a free version of Secrets Manager. The value of that service is that you can run arbitrary code to rotate credentials elsewhere, but it’ll cost you 40¢ per month per secret to use it. Now contrasted with that, Parameter Store is free. The security guarantees are the same; don’t view this as being somehow less secure because it’s missing the word ‘secrets’ in its name. Obviously, if you’re using something with a bit more oomph like HashiCorp’s excellent Vault, you can safely ignore everything that I just said. And that’s what happened last week in AWS security. If you’ve enjoyed listening to this, tell everyone you know to listen to it as well. Become an evangelist and annoy the hell out people, to my benefit. Thanks for listening and I’ll talk to you next week.

Corey: Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 11 Nov 2021 03:00:00 -0800
The Sneaky Weakness Behind AWS’ Managed KMS Keys

Want to give your ears a break and read this as an article? You’re looking for this link.
https://www.lastweekinaws.com/blog/The-Sneaky-Weakness-Behind-AWS'-Managed-KMS-keys



Never miss an episode



Help the show



What's Corey up to?

Wed, 10 Nov 2021 03:00:00 -0800
Amazon Thyme Sync
AWS Morning Brief for the week of 8 November, 2021 with Corey Quinn.
Mon, 08 Nov 2021 03:00:00 -0800
Security Awareness Training in Five Minutes

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by Liquibase. If you’re anything like me, you’ve screwed up the database part of a deployment so severely that you’ve been banned from ever touching anything that remotely sounds like SQL at least three different companies. We’ve mostly got code deployment solved for, but when it comes to databases, we basically rely on desperate hope, with a rollback plan of keeping our resumes up to date. It doesn’t have to be that way. Meet Liquibase. It’s both an open-source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails that ensure you’ll still have a company left after you deploy the change. No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at liquibase.com. Offer does not apply to Route 53.



Corey: I’ll be hosting a drinkup-slash-meetup at Optimism Brewery in Seattle tonight at 7 p.m. if you’re in town, stop on by and let me buy you a drink. And of course, re:Quinnvent approaches if you’re interested in keeping up with what my nonsense looks like, check out requinnvent.com.

Corey: Let’s see what happened in the world of security last week. Lydia Leong of Gartner has been on a tear lately. Don’t be surprised when ‘move fast and break things’ results in broken stuff is her latest and an important read. The goal isn’t to slow things down; it’s to build guardrails that mean you can move fast, safely. That’s the goal of security, to provide safety, not impenetrable blockers to getting work done. Forget this at your own peril.

I also wrote my own Security Awareness Training in the form of a Twitter thread. It’s like a normal version except it’s funny. Don’t discount that, though; it’s not a joke. If you make people laugh, you’ve gotten their attention. If you have their attention, then you’ve got a chance to teach them something.

What’d AWS have to say about security last week? Correlate security findings with AWS Security Hub and Amazon EventBridge. So, let me get this straight. AWS sells and charges for Amazon GuardDuty, Amazon Macie, Amazon Inspector, and Amazon Detective, but still wants you to wire stuff together yourself in order to correlate events? How are they so good at the technology bits and so very bad at the ‘tying it all together with a neat presentation’ part?



Corey: This episode is sponsored in part by something new. Cloud Academy is a training platform built on two primary goals: having the highest quality content in tech and cloud skills, and building a good community that is rich and full of IT and engineering professionals. You wouldn’t think those things go together, but sometimes they do. It’s both useful for individuals and large enterprises, but here’s what makes this something new—I don’t use that term lightly—Cloud Academy invites you to showcase just how good your AWS skills are. For the next four weeks, you’ll have a chance to prove yourself. Compete in four unique lab challenges where they’ll be awarding more than $2,000 in cash and prizes. I’m not kidding: first place is a thousand bucks. Pre-register for the first challenge now, one that I picked out myself on Amazon SNS image resizing, by visiting cloudacademy.com/corey—C-O-R-E-Y. That’s cloudacademy.com/corey. We’re going to have some fun with this one.

Three ways to improve your cybersecurity awareness program. It would seem that one of them isn’t, “Google for ‘Azure Security September’ and stand back.” I like the three points—which are: to be sure to articulate personal value, be inclusive, and weave it into workflows—because they’re not technical, they’re psychological. That’s where security, just like cloud economics, starts and stops. It’s people more than it is computers.

And Amazon releases free cybersecurity awareness training. Unfortunately, the transcript is all of 700 words long. This is a problem. Part of the reason you have a program to train staff on cybersecurity awareness is so you can make a good-faith argument that when you inevitably suffer an attack, you’d done all that you could to train folks on proper security behaviors. Unfortunately, a training program that’s made of fewer words than this podcast episode seems unlikely to be convincing.

And now to the tool. Remember when I talked about being able to enumerate roles and account IDs via public calls, but AWS said it wasn’t a problem? Meet Quiet Riot, a tool built to do exactly that in bulk. This is going to be a problem that AWS will have to acknowledge at some point. It’s your move, folks.

An AWS inventory collection tool called aws-recon that focuses on security-relevant metadata is a useful thing to have. The first and surprisingly difficult step of securing a cloud environment is understanding and enumerating what the heck’s running inside of it. I’m astounded that the only first-party answer to this remains ‘the bill.’

And finally, I found a Terraform module that deploys a Lambda to watch CloudTrail and report to Slack—got all that? Good lord—whenever certain things happen. Those things include root logins, console logins without MFA, API calls that failed due to lack of permissions, and more. This might get noisy, but I’d consider deploying at least the big important ones.

And that’s what happened last week in AWS security. I’ll talk to you next week.

Corey: I have been your host, Corey Quinn, and if you remember nothing else, it’s that when you don’t get what you want, you get experience instead. Let my experience guide you with the things you need to know in the AWS security world, so you can get back to doing your actual job. Thank you for listening to the AWS Morning Brief: Security Editionwith the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.



Announcer: This has been a HumblePod production. Stay humble.

Thu, 04 Nov 2021 03:00:00 -0700
The Unfulfilled Promise of Serverless

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/The-Unfulfilled-Promise-of-Serverless



Never miss an episode



Help the show



What's Corey up to?

Wed, 03 Nov 2021 03:00:00 -0700
The AWS Cwoud Backstowy
AWS Morning Brief for the week of November 1, 2021 with Corey Quinn.
Mon, 01 Nov 2021 07:21:39 -0700
A Secretive Experiment

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by Liquibase. If you’re anything like me, you’ve screwed up the database part of a deployment so severely that you’ve been banned from ever touching anything that remotely sounds like SQL at least three different companies. We’ve mostly got code deployment solved for, but when it comes to databases, we basically rely on desperate hope, with a rollback plan of keeping our resumes up to date. It doesn’t have to be that way. Meet Liquibase. It’s both an open-source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails that ensure you’ll still have a company left after you deploy the change. No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at liquibase.com. Offer does not apply to Route 53.

Corey: So, it’s been an interesting week in the world of AWS security, and a light one. And that’s okay. 1Password introduced 1Password University, and I’m interested in it, not because I expect to learn a whole lot that I didn’t know before about security, but because this might be able to replace my current, fairly awful Security Awareness Training.

See, a lot of companies have contractual requirements to provide SAT to their staff and contractors. Most of them are terrible courses that actively push crap advice like, “Rotate your password every 60 days.” This has the potential, just based on my experiences with 1Password, to be way better than that. But we’ll see.

“Things are different in the cloud,” is something of a truism, and that applies as much to penetration testing as anything else. Understanding that your provider may have no sense of humor whatsoever around this, and thus require you to communicate with them in advance, for example. There was a great interview with Josh Stella, who I’ve had on Screaming in the Cloud. He’s CEO of Fugue—that he will say is pronounced ‘Fugue’, but it’s ‘Fwage’—and he opined on this in an article I discovered, and interview, with quite some eloquence. I should really track him down and see if I can get him back on the podcast one of these days. It has been far too long.

now, from the mouth of AWS Horse. There’s a New AWS workbook for New Zealand financial services customers, and that honestly kind of harkens back to school: unnecessary work that you’re paying for the privilege of completing. But it is good to be able to sit down and work through the things you’re going to need to be able to answer in a world of cloud when you’re in a regulated industry like that, and those regulations vary from country to country. You can tell where the regulations around data residency are getting increasingly tight because that’s where AWS is announcing regions.

Corey: This episode is sponsored in part by something new. Cloud Academy is a training platform built on two primary goals: having the highest quality content in tech and cloud skills, and building a good community that is rich and full of IT and engineering professionals. You wouldn’t think those things go together, but sometimes they do. It’s both useful for individuals and large enterprises, but here’s what makes this something new—I don’t use that term lightly—Cloud Academy invites you to showcase just how good your AWS skills are. For the next four weeks, you’ll have a chance to prove yourself. Compete in four unique lab challenges where they’ll be awarding more than $2,000 in cash and prizes. I’m not kidding: first place is a thousand bucks. Pre-register for the first challenge now, one that I picked out myself on Amazon SNS image resizing, by visiting cloudacademy.com/corey—C-O-R-E-Y. That’s cloudacademy.com/corey. We’re going to have some fun with this one.

Corey: And of course, a tool for the week. I’ll be playing around with Secretive in the next week or two. It’s an open-source project that stores SSH keys in a Mac’s Secure Enclave instead of on disk. I don’t love the idea of having my key material on disk wherever possible, even though I do passphrase-protect it.

This stores it in the Mac Secure Enclave and presents it well. I’ve had a couple of problems on a couple of machines so far, and I’m talking to the developer in a GitHub issue, but it is important to think about these things. I, of course, turn on full-disk encryption, but if something winds up subverting my machine, I don’t want it to just be able to look at what’s on disk and get access to things that matter. That feels like it could blow up in my face.

Corey: And that’s really what happened last week in AWS security. It’s been a light week; I hope you enjoy it, there is much more to come next week, now that I’m back from vacation.



Corey: I have been your host, Corey Quinn, and if you remember nothing else, it’s that when you don’t get what you want, you get experience instead. Let my experience guide you with the things you need to know in the AWS security world, so you can get back to doing your actual job. Thank you for listening to the AWS Morning Brief: Security Editionwith the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 28 Oct 2021 03:00:00 -0700
The Dumbest Dollars a Cloud Provider Can Make

Want to give your ears a break and read this as an article? You’re looking for this link : http://www.lastweekinaws.com/blog/the-dumbest-dollars-a-cloud-provider-can-make



Never miss an episode



Help the show



What's Corey up to?

Wed, 27 Oct 2021 03:00:00 -0700
Chime SDK Background Bling
AWS Morning Brief for the week of October 25, 2021 with Corey Quinn.
Mon, 25 Oct 2021 03:00:00 -0700
AWS W(T)AF

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by Honeycomb. When production is running slow, it’s hard to know where problems originate. Is it your application code, users, or the underlying systems? I’ve got five bucks on DNS, personally. Why scroll through endless dashboards while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter. Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other; which one is up to you. Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business: production. With Honeycomb, you guess less and know more. Try it for free at honeycomb.io/screaminginthecloud observability; it’s more than just hipster monitoring.

Corey: I must confess, I didn’t expect to see an unpatched AWS vulnerability being fodder for this podcast so early in the security lifespan here, but okay. Yes, yes, before I get letters, it’s not a vulnerability as AWS would define it, but it’s a pretty crappy default that charges customers money while giving them a false sense of security.



Past that, it’s going to be a short podcast this week, and that’s just fine by me because the point of it is, “The things you should know as someone who has to care about security.” On slow news weeks like last week that means I’m not here to give you pointless filler. Onward.

Now, AWS WAF is expensive and apparently, as configured by default, entirely optional for attackers. Only the first 8KB of a request are inspected by default. That means that any malicious payload that starts after the 8KB limit in a POST request will completely bypass AWS WAF unless you’ve explicitly added a rule to block any POST request greater than 8KB in size, which you almost assuredly have not done. Even their managed rule that addresses size limits only kicks in at 10KB. This is—as the kids say—less than ideal.



I had a tweet recently that talked about the horror of us-east-1 being globally unavailable for ages. Tim Bray took this and ran with the horrifying concept in a post he called, “Worst Case.” It’s really worth considering things like this when it comes to disaster and continuity planning. How resilient are our apps and infrastructure really when all is said and done? What dependencies do we take on third parties who in
turn rely on the same infrastructure that we’re trying to guard against failure from?

An unfortunate reality is that many cybersecurity researchers don’t have much in the way of legal protections; some folks are looking to change that through legislation. Here’s some good advice: if a security researcher reports a vulnerability to you or your company in good faith, perhaps not acting like a raging jackhole is an option that’s on the table. Bug bounties are hilariously small; they could make many times as much money by selling vulnerabilities to the highest bidder. Instead they’re reporting bugs to you in good faith. Word spreads. If you’re a hassle to deal with, other researchers won’t report things to you in the future. “Be a nice person,” is surprisingly undervalued when it comes to keeping yourself and your company out of trouble.

Now, only one interesting thing came out of the mouth of AWS horse last week in a security context, and it’s a Core Principles whitepaper: “Introducing Security at the Edge.” Setting aside entirely the fact that neither contributor to this has the job title of “EdgeLord,” I like it. Rather than focusing on specific services—although of course there’s some of that because vendors are going to vendor—it emphasizes how to think about the various considerations of edge locations that aren’t deep within hardened data centers. “How should I think about this problem,” is the kind of question that really deserves to be asked a lot more than it is.

and lastly, let’s end up with a tip of the week. If you have a multi-cloud anything, ensure that credentials are not shared between two cloud providers. I’m talking about passwords, keys, et cetera. This is a step beyond the standard password reuse warning of not using the same password for multiple accounts. Think it through; if one of your providers happens to be Azure, and they Azure up the security yet again, you really don’t want that to grant an attacker or other random Azure customers access to your AWS account as well, do you? I thought not.



Corey: This episode is sponsored in part by Liquibase. If you’re anything like me, you’ve screwed up the database part of a deployment so severely that you’ve been banned from ever touching anything that remotely sounds like SQL at least three different companies. We’ve mostly got code deployment solved for, but when it comes to databases, we basically rely on desperate hope, with a rollback plan of keeping our resumes up to date. It doesn’t have to be that way. Meet Liquibase. It’s both an open-source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails that ensure you’ll still have a company left after you deploy the change. No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at liquibase.com. Offer does not apply to Route 53.



Corey: And that is what happened last week in AWS security. I have been your host, Corey Quinn, and if you remember nothing else, it’s that when you don’t get what you want, you get experience instead. Let my experience guide you with the things you need to know in the AWS security world, so you can get back to doing your actual job. Thank you for listening to the AWS Morning Brief: Security Edition with the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 21 Oct 2021 03:00:00 -0700
The Turbotax of AWS Billing

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/The-Turbotax-of-AWS-Billing



Never miss an episode



Help the show



What's Corey up to?

Wed, 20 Oct 2021 03:00:00 -0700
AWS Butt Computing
AWS Morning Brief for the week of October 18, 2021 with Corey Quinn.
Mon, 18 Oct 2021 03:00:00 -0700
AWS Security is Twitching

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by Thinkst Canary. This might take a little bit to explain, so bear with me. I linked against an early version of their tool, canarytokens.org, in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, or anything else like that that you can generate in various parts of your environment, wherever you want them to live. It gives you fake AWS API credentials, for example, and the only thing that these things do is alert you whenever someone attempts to use them. It’s an awesome approach to detecting breaches. I’ve used something similar for years myself before I found them. Check them out. But wait, there’s more because they also have an enterprise option that you should be very much aware of: canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment and manage them centrally. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files that it presents on a fake file store, you get instant alerts. It’s awesome. If you don’t do something like this, instead you’re likely to find out that you’ve gotten breached the very hard way. So, check it out. It’s one of those few things that I look at and say, “Wow, that is an amazing idea. I am so glad I found them. I love it.” Again, those URLs are canarytokens.org and canary.tools. And the first one is free because of course it is. The second one is enterprise-y. You’ll know which one of those you fall into. Take a look. I’m a big fan. More to come from Thinkst Canary in the weeks ahead.

Corey: To begin with, the big news is that week is the week of the year in which the Last Week in AWS charity shirt is available for sale. All proceeds to benefit 826 National. To get your snarky, sarcastic shirt, “The AWS Status Page,” this year, visit lastweekinaws.com/charityshirt and thank you in advance for your support.

Now, last week’s big security news was about Amazon’s subsidiary, Twitch—or Twetch, depending upon pronunciation. It had a bunch of its code repos and streamer payouts leaked. Given that they are in fact an Amazon company largely hosted on AWS, you know, except for the streaming parts; are you a lunatic? That would cost ALL the money—this makes it tricky for AWS to message this as not their problem as per their vaunted Shared Responsibility Model. What’s the takeaway? Too soon to say but, ouch.

From the community. Telegram offered a researcher a €1,000 bounty, which is just insultingly small. The researcher said, “Not so much,” and disclosed a nasty auto-delete bug. If you’re going to run a bug bounty program, ensure that you’re paying researchers enough money to incentivize them to come forward and deal with your no-doubt obnoxious disclosure process.

You can expect a whole bunch of people who don’t care about security to suddenly be asking fun questions as Google prepares to enroll basically all of its users into two-factor-auth. Good move, but heads up, support folks.

I found a detailed analysis of AWS account assessment tools. These use things like CloudSploit, which I’ll talk about in a bit, IAM Vulnerable, et cetera. Fundamentally, they all look at slightly different things; they’re also all largely the same, but it might be worth taking a look.

AWS has made statements indicating that they don’t believe that enumerating which IAM accounts exist in a given AWS account is a security risk, so someone has put out a great technique you can use to enumerate those yourself. Why not, since Amazon doesn’t find this to be a problem.

A reference to the various kinds of AWS Access Keys is also something I found relatively handy because I hadn’t seen this ever explained before. It taught me a lot about the different kinds of key nonsense that I encounter in the wild from time to time. Take a look, it’s worth the read.

It didn’t get a lot of attention in the press due to, you know, things last week, but a company that routes billions of text messages said that it was hacked. It’s worth pointing out that SMS is a garbage second-factor, just because how lax security around it is. I’m a big believer in hardware keys like Yubikeys for important stuff, and an app like Authy or Google Authenticator for less important or shared accounts.

I know, you shouldn’t be sharing accounts; as soon as you come up with a better way for multiple people in different locations to do things that require root credentials in an AWS account, do let me know. Back to my point; treat SMS as a second factor only as better than nothing, not a serious security bulwark when it matters.

Three things came out from the mouth of AWS horse last week. “Enabling Data Classification for Amazon RDS database with Amazon Macie.” While the idea of streaming from a relational database through a bunch of wildly expensive AWS services is of course ludicrous, the actual value of knowing what the data classification in your database is can’t be understated.

The best practice pattern here is to make sure that you’re bounding the truly sensitive stuff to its own location. For instance, instead of storing credit card information in ‘the database’; have a token that references a completely separate database that contains that information that’s severely locked down; that way any random business query doesn’t return sensitive data, and you can restrict access to that data to only the queries or groups or situations that require it. Note that this is only an example and you should not in fact be storing credit card numbers yourself. Good God.

Announcer: Have you implemented industry best practices for securely accessing SSH servers, databases, or Kubernetes? It takes time and expertise to set up. Teleport makes it easy. It is an identity-aware access proxy that brings automatically expiring credentials for everything you need, including role-based access controls, access requests, and the audit log. It helps prevent data exfiltration and helps implement PCI and FedRAMP compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.

Corey: “How to set up a two-way integration between AWS Security Hub and Jira Service Management.” Now, I’m not a big fan of either Jira or Security Hub, but integrating whatever it is that finds alerts into something that reports them to someone empowered to do something about them is kind of important. You’ve got to tune it, though. “Someone visited your website,” showing up 3000 times in an hour is going to be very noisy, and mask alerts of the form, “Your database is open to the world.”

They also talk about how to “Update the alternate security contact across your AWS accounts for timely security notifications.” You definitely want to ensure that every AWS account in your cloud estate has the right addresses here configured, and hope that someone who’s compromised your accounts doesn’t use this API to simply change them back again. It’ll stop you from doing that, right? Right? Hello?

And finally, MetaSploit is famous as an exploitation toolkit for systems. CloudSploit is attempting to be the same thing, only for cloud accounts. It’s not something you’ll likely use day-to-day, but it is a great way to spend an afternoon tinkering while also learning new things. And that’s what happened Last Week in AWS: Security. Thank you for listening and once again, I ask you, go ahead and visit lastweekinaws.com/charityshirt and get yours today.

Corey: I have been your host, Corey Quinn, and if you remember nothing else, it’s that when you don’t get what you want, you get experience instead. Let my experience guide you with the things you need to know in the AWS security world, so you can get back to doing your actual job. Thank you for listening to the AWS Morning Brief: Security Edition.

Thu, 14 Oct 2021 03:00:00 -0700
Why I Turned Down an AWS Job Offer Revisited

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/why-i-turned-down-an-aws-job-offer



Never miss an episode



Help the show



What's Corey up to?

Wed, 13 Oct 2021 03:00:00 -0700
Charity T-Shirt Week
AWS Morning Brief for the week of October 11, 2021 with Corey Quinn.
Mon, 11 Oct 2021 03:00:00 -0700
DNSSEC Inspired Outages

Links:

Transcript
Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by Thinkst Canary. This might take a little bit to explain, so bear with me. I linked against an early version of their tool, canarytokens.org, in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, or anything else like that that you can generate in various parts of your environment, wherever you want them to live. It gives you fake AWS API credentials, for example, and the only thing that these things do is alert you whenever someone attempts to use them. It’s an awesome approach to detecting breaches. I’ve used something similar for years myself before I found them. Check them out. But wait, there’s more because they also have an enterprise option that you should be very much aware of: canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment and manage them centrally. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files that it presents on a fake file store, you get instant alerts. It’s awesome. If you don’t do something like this, instead you’re likely to find out that you’ve gotten breached the very hard way. So, check it out. It’s one of those few things that I look at and say, “Wow, that is an amazing idea. I am so glad I found them. I love it.” Again, those URLs are canarytokens.org and canary.tools. And the first one is free because of course it is. The second one is enterprise-y. You’ll know which one of those you fall into. Take a look. I’m a big fan. More to come from Thinkst Canary in the weeks ahead.

Corey: Somehow we made it through an entire week without a major vendor having a headline-level security breach. You know, I could get used to this; I’ll take, “It’s harder for me to figure out what to talk about here,” over, “A bunch of customers are scrambling because their providers have failed them,” every time.



So, let’s see what the community had to say. Last week, as you’re probably aware, Let’s Encrypt’s root certificate expiredwhich caused pain for a bunch of folks. Any device or configuration that hadn’t been updated for a few years is potentially going to see things breaking. The lesson here is to be aware that certificates do expire. The antipattern is to do super-long registrations for thing, but that just makes it worse.



One of the things Let’s Encrypt got very right is forcing 90-day certificate rotations for client certs. When you’ve got to do that every three months, you know where all of your certificates are. If you’ve got to replace it once every ten years, you’ll have no clue; that was six employees ago.



In bad week news, Slack was bitten by DNSSEC when they attempted and failed to roll it out. DNSSEC is a bag of pain it’s best not to bother with, as a general rule. DNS is always a bag of pain because of caching and TTL issues. In effect, Slack tried to roll out DNSSEC—probably due to a demand by some big corporate customer—had it fail, panicked and rolled back the change, and was in turn bitten by outages as a bunch of DNS resolvers had the DS key cached, but the authoritative nameservers stopped publishing it. This is a mess and a great warning to those of us who might naively assume that anything like DNSSEC that offers improved security comes without severe tradeoffs. Measure twice, cut once because mistakes are going to show.

I also found a somewhat alarmist article talking about cybersecurity assessments from your customers and fine, but it brings up a good point. If you’re somehow responsible for security but don’t have security in your job title—which, you know, this show is aimed at—you may one day be surprised to have someone from sales pop up and ask you to fill out a form from a prospective customer. Ignore the alarm and the panic but you’re going to want to get towards something approaching standardization around how you handle those.

The first time you get one of these, it’s a novel exercise; by the tenth, you just want to have a prepared statement you can hand them so you can move on with things. Well, those prepared statements are often called things like, “SOC 2 certifications.” There’s a spectrum and where you fall on it depends upon who you work for and what you do. So, take them seriously and don’t be surprised when you get one.

AWS had a few interesting security-related announcements. AWS Lambda now supports triggering Lambda functions from an Amazon SQS queue in a different account. That doesn’t sound like a security announcement, so why am I talking about it? Because until recently, it wasn’t possible so a lot of folks scoped their IAM policies very broadly; what do you care if any random SQS queue in your own account can invoke a Lambda? With this change, suddenly internet randos can invoke Lambda functions, and you should probably go check production immediately.

Announcer: Have you implemented industry best practices for securely accessing SSH servers, databases, or Kubernetes? It takes time and expertise to set up. Teleport makes it easy. It is an identity-aware access proxy that brings automatically expiring credentials for everything you need, including role-based access controls, access requests, and the audit log. It helps prevent data exfiltration and helps implement PCI and FedRAMP compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.

Corey: Migrating custom Landing Zone with RAM to AWS Control Tower. It’s worth considering the concept here because, “Using the polished thing” is usually better than building and then maintaining something yourself. You wind up off in the wilderness; then AWS shows up and acts befuddled, “Why on earth would you build things the way that we told you to build them at the time you set up your environment?” It’s obnoxious and they need to stop talking and own their mistakes, but keeping things current with the accepted way of doing things is usually worth at least considering.

AWS has a whitepaper on Ransomware Risk Management out and I’m honestly conflicted about it. There are gems but it talks about a pile of different services they offer to offset the risk. Some of them—like AWS Backup—are great.

Others—“Use Systems Manager State Manager”—present as product pitches for products of varying quality and low adoption. On balance, it’s worth reading but retain a healthy skepticism if you do. It should be noted that the points that the address and the framework they lay out is exactly how risk management folks think, and that’s helpful.



Validate IAM policies in CloudFormation templates using IAM Access Analyzer. I like that one quite a bit. It does what it says on the tin, and applies a bunch of more advanced linting rules than you’d find in something like cfn-lint.



Note that this costs nothing for a change, even though it does communicate with AWS to run its analysis. Note that as AWS improves the Access Analyzer, findings will likely change, so be aware that this may well result in a regression should you have it installed as part of a CI/CD pipeline.

And as far as tools go, if you’re not a security researcher, good; you’re in the right place. But that said, if you have a spare afternoon at some point, you may want to check out Pacu—that’s P-A-C-U. It’s an open-source AWS exploitation framework that lets you see just how insecure your AWS accounts might be. I generally leave playing with those sorts of things to security professionals, but this is a fun way to just take a quick check and see if there’s a burning fire that jumps out that might arise for you down the road. And I’ll talk to you more about all this stuff next week.

Corey: I have been your host, Corey Quinn, and if you remember nothing else, it’s that when you don’t get what you want, you get experience instead. Let my experience guide you with the things you need to know in the AWS security world, so you can get back to doing your actual job. Thank you for listening to the AWS Morning Brief: Security Edition.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 07 Oct 2021 03:00:00 -0700
The Compelling Economics of Cloudflare R2

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/The-Compelling-Economics-of-Cloudflare-R2



Never miss an episode



Help the show



What's Corey up to?

Wed, 06 Oct 2021 03:00:00 -0700
Cloudflare's Object Storage Lesson
AWS Morning Brief for the week of September 3, 2021 with Corey Quinn.
Mon, 04 Oct 2021 03:00:00 -0700
F5's Refreshing Culture

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by Thinkst Canary. This might take a little bit to explain, so bear with me. I linked against an early version of their tool, canarytokens.org, in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, or anything else like that that you can generate in various parts of your environment, wherever you want them to live. It gives you fake AWS API credentials, for example, and the only thing that these things do is alert you whenever someone attempts to use them. It’s an awesome approach to detecting breaches. I’ve used something similar for years myself before I found them. Check them out. But wait, there’s more because they also have an enterprise option that you should be very much aware of: canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment and manage them centrally. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files that it presents on a fake file store, you get instant alerts. It’s awesome. If you don’t do something like this, instead you’re likely to find out that you’ve gotten breached the very hard way. So, check it out. It’s one of those few things that I look at and say, “Wow, that is an amazing idea. I am so glad I found them. I love it.” Again, those URLs are canarytokens.org and canary.tools. And the first one is free because of course it is. The second one is enterprise-y. You’ll know which one of those you fall into. Take a look. I’m a big fan. More to come from Thinkst Canary weeks ahead.

Corey: This podcast seems to be going well. The Meanwhile in Security podcast has been fully rolled over and people are chiming in with kind things, which kind of makes me wonder, is this really a security podcast? Because normally people in that industry are mean.



Let’s dive into it. What happened last week in security? touching AWS, Ben Kehoe is on a security roll lately. His title of the article in full reads, “I Trust AWS IAM to Secure My Applications. I Don’t Trust the IAM Docs to Tell Me How”, and I think he’s put his finger on the pulse of something that’s really bothered me for a long time. IAM feels arcane and confusing. The official doc just made that worse For me. My default is assuming that the problem is entirely with me, But that’s not true at all. I suspect I’m very far from the only person out there who feels this way.

An “Introduction to Zero Trust on AWS ECS Fargate” is well-timed. Originally when Fargate launched, the concern was zero trust of AWS ECS Fargate, But we’re fortunately past that now. The article is lengthy and isn’t super clear as to the outcome that it’s driving for and also forgets that SSO was for humans and not computers, But it’s well documented and it offers plenty of code to implement such a thing yourself. It’s time to move beyond static IAM roles for everything.

Threat Stack has been a staple of the Boston IT scene for years; they were apparently acquired by F5 for less money than they’d raised, which seems unfortunate. I’m eagerly awaiting to see how they find F5 for culture. I bet it’s refreshing.



and jealous of Azure as attention in the past few episodes of this podcast, VMware wishes to participate by including a critical severity flaw that enables ransomware in vCenter or vSphere. I can’t find anything that indicates whether or not VMware on AWS is affected, So those of you running that thing you should probably validate that everything’s patched. reach out to your account manager, which if you’re running something like that, you should be in close contact with anyway.

Corey: Now from AWS themselves, what do they have to say? not much last week on the security front, their blog was suspiciously silent. scuttlebutt on Twitter has it that they’re attempting to get themselves removed from an exploit, a CVE-2021-38112, which is a remote code execution vulnerability. If you have the Amazon workspaces client installed, update it because a malicious URL could cause code to be executed in the client’s machine. It’s been patched, but I think AWS likes not having public pointers to pass security lapses lurking around. I don’t blame them, I mean, who wants that? The reason I bring it up is Not to shame them for it, but to highlight that all systems have faults in them. AWS is not immune to security problems, nor is any provider. It’s important, to my mind, to laud companies for rapid remediation and disclosure and to try not to shame them for having bugs in the first place. I don’t always succeed at it, But I do try. But heaven help you if you try to blame an intern for a security failure.

And instead of talking about a tool, Let’s do a tip of the week. Ransomware is in the news a lot, But so far, all that I’ve seen with regard to ransomware that encrypts the contents of S3 buckets is theoretical proofs—or proves—of concept. That said, for the data you can’t afford to lose, you’ve got a few options that stack together neatly. The approach distills down to some combination of enabling MFA delete, enabling versioning on the bucket, and setting up replication rules to environments that are controlled by different credential sets entirely. This will of course become both maintenance-intensive and extremely expensive for some workloads, But it’s always a good idea to periodically review your use of S3 and back up the truly important things.

Announcer: Have you implemented industry best practices for securely accessing SSH servers, databases, or Kubernetes? It takes time and expertise to set up. Teleport makes it easy. It is an identity-aware access proxy that brings automatically expiring credentials for everything you need, including role-based access controls, access requests, and the audit log. It helps prevent data exfiltration and helps implement PCI and FedRAMP compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.



Corey: I have been your host, Corey Quinn, and if you remember nothing else, it’s that when you don’t get what you want, you get experience instead. Let my experience guide you with the things you need to know in the AWS security world, so you can get back to doing your actual job. Thank you for listening to the AWS Morning Brief: Security Editionwith the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcast, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 30 Sep 2021 03:00:00 -0700
The Actual Next 1 Million Cloud Customers

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/The-Actual-Next-1-Million-Cloud-Customers



Never miss an episode



Help the show



What's Corey up to?

Wed, 29 Sep 2021 03:00:00 -0700
Old Zealand's Data Center Migration
AWS Morning Brief for the week of September 27,2021 with Corey Quinn.
Mon, 27 Sep 2021 03:00:00 -0700
OMIGOD, Get it Together Already

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by Thinkst Canary. This might take a little bit to explain, so bear with me. I linked against an early version of their tool, canarytokens.org, in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, or anything else like that that you can generate in various parts of your environment, wherever you want them to live. It gives you fake AWS API credentials, for example, and the only thing that these things do is alert you whenever someone attempts to use them. It’s an awesome approach to detecting breaches. I’ve used something similar for years myself before I found them. Check them out. But wait, there’s more because they also have an enterprise option that you should be very much aware of: canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment and manage them centrally. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files that it presents on a fake file store, you get instant alerts. It’s awesome. If you don’t do something like this, instead you’re likely to find out that you’ve gotten breached the very hard way. So, check it out. It’s one of those few things that I look at and say, “Wow, that is an amazing idea. I am so glad I found them. I love it.” Again, those URLs are canarytokens.org and canary.tools. And the first one is free because of course it is. The second one is enterprise-y. You’ll know which one of those you fall into. Take a look. I’m a big fan. More to come from Thinkst Canary weeks ahead.

Corey: Oh, for th—this is the third episode of the Last Week in AWS slash AMB: Security Edition, and instead of buying a sponsorship like a reasonable company, Microsoft Azure is once again forcing me to talk about their cloud instead, via completely blowing it when it comes to security. Again. Not only did they silently install an agent onto virtual machines in Azure that add a handful of trivially exploitable vulnerabilities, it’s also apparently your job to fix it for them. I have to confess, I take Azure a lot less seriously than I did a month ago.

Now, let’s dive in here. Speaking of terrible things, it’s honestly difficult for me to imagine a company screwing the pooch harder than TravisCI did this month. They had a bug that started leaking private credentials into public build logs; this is bad. They fixed it; this is good. And then only begrudgingly disclosed it in a buried release with remarkably little public messaging; this is unfathomable. At this point, if you’re using TravisCI, get the hell off of it. Mistakes happen to every vendor. The ones that try to hide their mistakes are absolutely not companies you can trust.

If you put up a slide deck and accompanying notes entitled How to Build Strong Security Guardrails in the AWS Cloud With Minimal Effort, I’m probably going to take a look at it because strong guardrails are important and minimal effort is critical if you expect it to actually get done. If you’re also my longtime friend Mark Nunnikhoven, then I’m going to default to treating it as gospel because Mark frankly does not miss when it comes to AWS concepts explained in an easily approachable way. Security has got to be aligned with the way engineers work within your environment. Remember, it’s not that hard to spin up a new AWS account on someone’s corporate credit card; you absolutely do not want to incentivize that behavior.

Corey: I periodically say the OWASP Top 10, which is a list of the most critical security risks for applications on the web, has not meaningfully changed in ten years. Well, apparently it just did. It’s worth reviewing the changes; broken configurations top the list. The Open Web Application Security Project—OWASP—is a foundation that’s remained surprisingly free of capture by security vendors. It’s a good starting point to frame your risk exposure and what to think about.

AWS VP and Distinguished Engineer Colm MacCárthaigh has an article on AWS’s new signing protocol, along with the differences between AWS SIGv4 and SIGv4A. As a quick primer, all requests to AWS are signed for authentication reasons. The new SIGv4A isn’t region-locked—and the recent release of the S3 Multi-Region Access Points is why it makes it a bit of a problem—there’s no key exchange, and it’s more computationally expensive. You don’t really need to know the details as a practitioner, but you should be aware that AWS very much does put stupendous thought into this, and they sweat the details something fierce. This is why we trust cloud providers like AWS, and Google Cloud, and absolutely not Azure.

Figma has a great post up, talking about how they stopped using SSH via bastion host and started using Systems Manager Session Manager instead. Bad name, wonderful service. More to the point, what I like about this post isn’t just the, “Here’s how the technology works,” parts, but also dives into the nuts and bolts of how they handled the migration without stopping work for folks. Communicating changes like this is tricky; don’t lose sight of that.

Now, from the mouth of AWS horse itself, let’s dive in. AWS Firewall Manager now supports AWS WAF rate-based rules. This is pretty awesome if for no other reason than it’s aware both of multiple regions as well as multiple accounts.

An awful lot of security services that are both first and third-party alike tend to go for addressing only one of those at best. Anything that lets you manage things centrally in a holistic way when it comes to security is generally going to be a win, but you also don’t want a giant single point of failure. It’s a bit of a balancing act, but that’s why our field needs us. It’s why they pay us.

How to automate incident response to security events with AWS Systems Manager Incident Manager. And I’m genuinely torn on this. I like automation, but it strikes me as a way to end up automating the responses to fairly common things rather than addressing the actual cause so you get fewer false alarms. You really don’t want the security pager going off frequently, if for no other reason than you’ll be training the people carrying it to ignore it.

Announcer: Have you implemented industry best practices for securely accessing SSH servers, databases, or Kubernetes? It takes time and expertise to set up. Teleport makes it easy. It is an identity-aware access proxy that brings automatically expiring credentials for everything you need, including role-based access controls, access requests, and the audit log. It helps prevent data exfiltration and helps implement PCI and FedRAMP compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.

Corey: AWS is harping about its New Standard Contractual Clauses now part of the AWS GDPR Data Processing Addendum for customers, blah, blah, blah—look, if you have compliance obligations, here’s what you do. Check the documents in AWS Artifact, reach out to your account manager for additional resources, and whatever you do, do not attempt to YOLO it yourself from first principles. AWS has piles and piles of documents ready and waiting to satisfy regulators and auditors alike. I tried to do it myself once, and a financial institution attempted to set up a tour of us-east-one. Trust me when I say you don’t want to go down that path.



Protect your remote workforce by using a managed DNS firewall and network firewall. Look, the post can safely be discarded; it’s chock full of complexity lurking deep in the weeds, but I bring it up instead so that you think for a moment about the threat model of a remote workforce, read as most of them these days. Does having a DNS firewall protect against threats that they’re likely to encounter? Does a network firewall make sense in a zero-trust world? Consider those things in the context of your environment rather than in the context of a company that has things it needs to sell you. Good decisions are rarely sourced from vendors.

A couple of tools as well. Automating response and remediation is one of those delicate balances. The unimaginatively named AWS Security Hub Automated Response and Remediation GitHub repo has ways to handle this but it’s going to be super easy to automate away things that really shouldn’t be automated. You are definitely going to want to think through edge and corner cases.

And lastly, I tripped over checkov last week. It analyzes your Terraform slash CloudFormation slash whatever configurations for various misconfigurations. It caught a couple of things that I’ve been ignoring for a while, and while it missed another couple of problems in my environment, it’s definitely going to be something I integrate into my deployment pipelines in the future, once I have deployment pipelines.
That’s checkov—C-H-E-C-K-O-V—open-source projects. Take a look. I’m a fan.



And that’s what happened to the world of AWS security last week. Enjoy not having to care about the rest of it.



Corey: I have been your host, Corey Quinn, and if you remember nothing else, it’s that when you don’t get what you want, you get experience instead. Let my experience guide you with the things you need to know in the AWS security world, so you can get back to doing your actual job. Thank you for listening to the AWS Morning Brief: Security Editionwith the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcasts, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 23 Sep 2021 03:00:00 -0700
17 More Ways to Run Containers on AWS

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/17-more-ways-to-tun-containers-on-aws



Never miss an episode



Help the show



What's Corey up to?

Wed, 22 Sep 2021 03:00:00 -0700
Billed on AWS For Startups
AWS Morning Brief for the week of September 20, 2021 with Corey Quinn.
Mon, 20 Sep 2021 03:00:00 -0700
I Azure You This Shall Pass

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by Thinkst Canary. This might take a little bit to explain, so bear with me. I linked against an early version of their tool, canarytokens.org, in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, or anything else like that that you can generate in various parts of your environment, wherever you want them to live; it gives you fake AWS API credentials, for example. And the only thing that these are empowered to do is alert you whenever someone attempts to use them. It’s an awesome approach to detecting breaches. I’ve used something similar for years myself before I found them. Check them out. But wait, there’s more because they also have an enterprise option that you should be very much aware of: canary.tools. Take a look at this: what it does is it provides an enterprise approach to drive these things throughout your entire environment and manage them centrally. You can even get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files that it presents on a fake file store, you get instant alerts. It’s awesome. If you don’t do something like this, instead you’re likely to find out that you’ve gotten breached the very hard way. So, check it out. It’s one of those few things that I look at and say, “Wow, that is an amazing idea. I am so glad I found them. I love it.” Again, those URLs are canarytokens.org and canary.tools. And the first one is free because of course it is. The second one is enterprise-y. You’ll know which one of those you fall into. Take a look. I’m a big fan. More to come from Thinkst Canary in the weeks ahead.



Corey: Ben Kiko, cloud robotics research scientist at iRobot—motto: “All IoT sucks, but ours is supposed to”—walks us through Principles in AWS IAM. It’s short, it’s concise, and it’s definitely worth taking the time to dig into what he has to say. If you only hunt down one thing from this podcast this week, this is the one.

[Version three of OpenSSL was released 00:03:19], so expect a few conversations around that. There’s also apparently a Rusttls, which is ostensibly OpenSSL rewritten in Rust for the modern era but is in practice just another talking point for the Rust evangelism strikeforce, who is actively encouraged not to find a way to leave a comment on this episode.

Sneak or Snack or Synack raised—however they’re pronounced—[raised a big funding round last week 00:03:19] and still stubbornly refuses to buy a vowel. More interestingly, they report that 50% of security jobs are unfilled. Further, any solution predicated on devs becoming security experts is doomed, which is exactly the point of this podcast. What you need to know about cloud security, minus the fluff and
gatekeeping. Okay fine, yes, and some snark added to keep it engaging because my God, is it dull without that.

Another week, another [Azure Security failure 00:03:19]. This time a flaw existed that could leak data between users of Azure Container Services. Look, this whole thing is about AWS, so why do I talk about Azure issues like this? Simply put, people are going to bring it up in a cloud isn’t secure context, and you should be aware of what they’re talking about when they do. Azure, please get it together. Stuff like this hurts all cloud providers.

Corey: Troy Hunt has a post informing you that despite what your AWS bill may have you believe in the moment, self-immolation is unnecessary. Okay, that’s not actually his point, but specifically, You Don’t Need to Burn off Your Fingertips (and Other Biometric Authentication Myths) doesn’t hit quite the same way. It’s a super handy reminder that for most of you folks, adversaries are not going to steal your fingerprints to get into your systems. They’re either going to bribe you or hit you with a wrench until you tell them your password.

From the mouth of AWS horse—or from the horse’s AWS—Amazon Detective offers Splunk integration. Amazon Detective and the Case of the Missing Mountain of Money is apparently this month’s hot comic book.

And AWS—motto: “Opinions my own”—has a [security checklist 00:03:19], and it’s worth taking a look at because a few of these items that they issue from time to time are, like, “Use multiple AWS accounts,” directly contravenes older guidance. It’s always good to check on things like this around best practices that AWS is putting out there because even if you don’t make changes to your systems as a result, you should know where AWS’s head is at with respect to where the future of the industry is going.

And lastly, there was an interesting tool that came out called IAM Vulnerable. It’s an IAM privilege escalation playground that lets you muck around with exploiting improperly set IAM policies. It’s a good way to kill an hour on an afternoon when you’re not particularly motivated to do other things. Another good ‘I need a distraction’ task is rotating reused or weak passwords that you have in your password manager. And that’s what happened.

Announcer: Have you implemented industry best practices for securely accessing SSH servers, databases, or Kubernetes? It takes time and expertise to set up. Teleport makes it easy. It is an identity-aware access proxy that brings automatically expiring credentials for everything you need, including role-based access controls, access requests, and the audit log. It helps prevent data exfiltration and helps implement PCI and FedRAMP compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.

Corey: I have been your host, Corey Quinn, and if you remember nothing else, it’s that when you don’t get what you want, you get experience instead. Let my experience guide you with the things you need to know in the AWS security world, so you can get back to doing your actual job. Thank you for listening to the AWS Morning Brief: Security Editionwith the latest in AWS security that actually matters. Please follow AWS Morning Brief on Apple Podcasts, Spotify, Overcast—or wherever the hell it is you find the dulcet tones of my voice—and be sure to sign up for the Last Week in AWS newsletter at lastweekinaws.com.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 16 Sep 2021 03:00:00 -0700
Why Your AWS Bill is Likely a Product of 2 Pizza Teams

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/awss-per-service-margins/



Never miss an episode



Help the show



What's Corey up to?

Wed, 15 Sep 2021 03:00:00 -0700
Amazon EKS AnyVMware
AWS Morning Brief for the week of September 13, 2021 with Corey Quinn.
Mon, 13 Sep 2021 03:00:00 -0700
Welcome to AMB: Security Edition

Links:


Transcript

Corey: This is the AWS Morning Brief: Security Edition. AWS is fond of saying security is job zero. That means it’s nobody in particular’s job, which means it falls to the rest of us. Just the news you need to know, none of the fluff.

Corey: This episode is sponsored in part by Thinkst Canary. This might take a little bit to explain, so bear with me. I linked against an early version of their tool, canarytokens.org, in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, or anything else like that that you can generate in various parts of your environment, wherever you want them to live; it gives you fake AWS API credentials, for example. And the only thing that these are empowered to do is alert you whenever someone attempts to use them. It’s an awesome approach to detecting breaches. I’ve used something similar for years myself before I found them. Check them out. But wait, there’s more because they also have an enterprise option that you should be very much aware of: canary.tools. Take a look at this: what it does is it provides an enterprise approach to drive these things throughout your entire environment and manage them centrally. You can even get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files that it presents on a fake file store, you get instant alerts. It’s awesome. If you don’t do something like this, instead you’re likely to find out that you’ve gotten breached the very hard way. So, check it out. It’s one of those few things that I look at and say, “Wow, that is an amazing idea. I am so glad I found them. I love it.” Again, those URLs are canarytokens.org and canary.tools. And the first one is free because of course it is. The second one is enterprise-y. You’ll know which one of those you fall into. Take a look. I’m a big fan. More to come from Thinkst Canary in the weeks ahead.



Corey: This is the inaugural episode of what is going to become a weekly feature, the AWS Morning Brief: Security Edition, where I do what I normally do: round up the news from Amazon’s cloud ecosystem, pick the things that I find interesting and make fun of them, only in the security world. This is going to be things that the rest of us need to care about, not the things that AWS feels a content need to put out there, but no one in the trenches tends to read. If you don’t work in security—by which I mean have the word security not in your job title—you’re in the right place. Neither do I, but I still have to care. So, what happened last week? Well, let’s dive in and we’ll see how this show shapes up.

We begin with the fact that there’s a contingent of anti-cloud folks out there who make the argument that [the cloud is somehow insecure, unsafe for your data, and not something you should be doing 00:08:26]. I generally have little patience for those folks, but when Azure’s Cosmos DB had a bug that allowed third parties unfettered and unlogged access to customer data, I’m hard-pressed to disagree with them. Events like this aren’t good for anyone. Companies don’t say things like, “Wow, as your security seems dicey, I’m going to use AWS or Google Cloud instead.” They say things instead, like, “Can’t trust the cloud. Hey, Dewey, fire up your Motel Six loyalty card because you’re about to spend the next nine months on the road building more company data centers for us.” Events like this weaken us all.



The second volume of the Lacework Cloud Threat Report has been released, and one of the things I really appreciate about it is that it talks about what’s actually going on in the wild, not invented theoretical threats that are designed to get you to shovel money into their product. I do not and will not condone the fear, uncertainty, and doubt—or FUD—marketing approach. There’s a reason that The Duckbill Group’s web pages are about how we help, not stuffed full of dire warnings about what might go wrong and blow the budget. If I can do it, so can the entire security industry. Nice job, Lacework, on that one.

There was a [great screed on Twitter 00:08:26] last week on the perils of using AWS read-only managed policies. The gist of the argument is that AWS is always updating these things, and permissions that aren’t included today may well be included tomorrow. Further, AWS does indeed have over-scoped permissions in managed policies. I gave a talk about one of them at re:Invent 2019. It’s a good thing to be aware of. While managed policies are definitely convenient, even AWS claims its security policies all squarely on the customer side of the shared responsibility model. Well, when they screw theirs up, they claim that anyway.



Luc van Donkersgoed recently found an enumeration vulnerability in AWS that allows users to determine valid account IDs and any IAM principles in it. AWS insists that this information is not sensitive and thus this doesn’t constitute a vulnerability. I can see that viewpoint, but if it’s true, why do AWS blog post screenshots always blur the account ID? Why isn’t there an API to explicitly get the account ID for a given resource?

The AWS documentation on account identifiers states that you shouldn’t provide credentials to third parties; it doesn’t say anything about account IDs. The messaging is, at a minimum, confusing. Until then, treat your AWS account ID as sensitive, I guess. There’s not a lot of reason for third parties to need it. I just wish AWS would stop being misunderstood for long periods of time on this particular point.

Announcer: Have you implemented industry best practices for securely accessing SSH servers, databases, or Kubernetes? It takes time and expertise to set up. Teleport makes it easy. It is an identity-aware access proxy that brings automatically expiring credentials for everything you need, including role-based access controls, access requests, and the audit log. It helps prevent data exfiltration and helps implement PCI and FedRAMP compliance. And best of all, Teleport is open-source and a pleasure to use. Download Teleport at goteleport.com. That’s goteleport.com.

Corey: [Imperva has a post 00:08:26] that, while it extolled the virtues of paying them money, it also alludes to the fact that a botnet attack that can hurl stupendous volumes of traffic is available for something like five bucks an hour. First, it turns out that revenge against things like the Managed NAT Gateway pricing page are way less money than I thought they were. Secondly, and more relevant to you folks than to me, is to have a plan [laugh] for what happens when some trash goblin decides that your company has displeased them and hurls a bunch of garbage traffic your way. Do a quick exploration of various options in this space—none of which I have recent enough experience with to endorse—and have a plan before you get a phone call from your boss, screaming that the website is down. Fix it, fix it, fix it, now.

If you work at Facebook, this entire section doesn’t apply to you since when your site is down, the internet is clearly better for it. There was a guide to High Availability WireGuard On AWS which was useful, and I’m not saying that from the perspective of explicitly running WireGuard per se, but more in terms of having single points of failure in things like the network that almost always stay up because the cloud is pretty good at things. Instead, this guide is primer, instead of focusing on WireGuard, how to think about your network risk exposure because I assure you there are security implications there.

Now, what did AWS have to say on their blog? This is that time of the podcast. How to improve visibility into AWS WAF with anomaly detection, and the honest answer is to pay a partner.

Look, I’m no happier about needing to drag third parties in to perform basic tasks on a potentially expensive AWS service than you are, but bolting together the monstrosity that AWS talks about in this post is not going to win you any friends. The biggest problem with a lot of these ‘build it from popsicle sticks’ solutions is that they’re complicated. Complexity is insecure just because you don’t understand the various nuances that go into all the different parts, and that leads to security lapses.

How US federal agencies can authenticate to AWS with multi-factor authentication. Because it’s federal, you undoubtedly have to use a government-grade MFA device. They no doubt weigh 50 pounds, cost $40,000 a pop, and take 20 minutes to boot up before they can be used.

Ransomware mitigation: Top 5 protections and recovery preparation actions. There’s good advice in this article. They are also cross-sells to other AWS services in this article. And this is my entire problem with the way these articles are structured. The actually good advice gets dismissed as a sales pitch.

And finally, Top 10 security best practices for securing data in Amazon S3. Some sales pitches, some good tips, and of course, encrypt your data in S3 without ever explaining why to do such a thing. Are people stealing discs out of AWS data centers? No? Okay, so that’s off the table is a threat model.

What precisely does encrypting data at rest buy you? That’s said, it’s not a hill worth dying on. Check the box, appease your auditor, and get on with doing the things that are important in your environment. And that’s the point of this podcast: because you’re not going to win those arguments, you’ll spend a lot of time on it. I’m here to make your job easier.

That is all the stuff that you need to be aware of that happened in AWS security last week. Well, that we know about. I’m sure something horrifying has happened that we will hear about in future weeks.

Corey: I have been your host, Corey Quinn, and if you remember nothing else, it’s that when you don’t get what you want, you get experience instead. Let my experience guide you with the things you need to know in the AWS security world, so you can get back to doing your actual job. Thank you for listening to the AWS Morning Brief: Security Edition.

Announcer: This has been a HumblePod production. Stay humble.

Thu, 09 Sep 2021 03:00:00 -0700
SaaS Cost Tools Suck

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/saas-cost-tools-suck



Never miss an episode



Help the show



What's Corey up to?

Wed, 08 Sep 2021 06:44:26 -0700
Malevolent Clown Computing
AWS Morning Brief for the week of September 6, 2021 with Corey Quinn.
Mon, 06 Sep 2021 03:00:00 -0700
Hey AWS, You’re Missing Forrest for the Trees

Want to give your ears a break and read this as an article? You’re looking for this link https://www.lastweekinaws.com/blog/hey-aws-youre-missing-forrest-for-the-trees/



Never miss an episode



Help the show



What's Corey up to?

Wed, 01 Sep 2021 03:00:00 -0700
Error 500: You Suck At Computers
AWS Morning Brief for the week of August 30, 2021 with Corey Quinn.
Mon, 30 Aug 2021 03:00:00 -0700
How to Effectively Interview for Work with a Portfolio Site

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/How-to-Effectively-Interview-for-Work-with-a-Portfolio-Site



Never miss an episode



Help the show



What's Corey up to?

Thu, 26 Aug 2021 07:00:00 -0700
Forget MemoryDB
AWS Morning Brief for the week of August 23, 2021 with Corey Quinn.
Mon, 23 Aug 2021 03:00:00 -0700
A MultiCloud Rant

Transcript

Corey: This episode is sponsored in part by our friends at ChaosSearch. You could run Elasticsearch or Elastic Cloud—or OpenSearch as they’re calling it now—or a self-hosted ELK stack. But why? ChaosSearch gives you the same API you’ve come to know and tolerate, along with unlimited data retention and no data movement. Just throw your data into S3 and proceed from there as you would expect. This is great for IT operations folks, for app performance monitoring, cybersecurity. If you’re using Elasticsearch, consider not running Elasticsearch. They’re also available now in the AWS marketplace if you’d prefer not to go direct and have half of whatever you pay them count towards your EDB commitment. Discover what companies like Klarna, Equifax, Armor Security, and Blackboard already have. To learn more, visit chaossearch.io and tell them I sent you just so you can see them facepalm, yet again.

Corey: You know what really grinds my gears? Well, lots of things, but in this case, let’s talk about multi-cloud. Not my typical rant about multi-cloud not ever being a good best practice—because it’s not—but rather how companies talk about multi-cloud. HashiCorp just did a whole survey on how multi-cloud is the future, and at no point during that entire process did they define the term. So, you wind up with a whole bunch of people responding, each one talking about different things.

Are we talking about multiple clouds and we have a workload that flows between them? Are we talking about, “Well, we have some workloads on one cloud provider and a different set of workloads on other cloud providers?” Did they break it down as far as SaaS companies go of, “Yeah, we have an application and we’d like to run it all on one cloud, but it’s data-heavy and we have to put it where our customers are, so of course we’re on multiple cloud providers.” And then you wind up with the stories that other companies talk about, where you have a bunch of folks where their sole contribution to the ecosystem is, “Ah, you get a single pane of glass between different cloud providers.”

You know who wants that? No one. The only people who really care about those things are the folks who used to sell those items and realized that if this dries up and blows away, they have nothing left to sell you. There’s also a lot of cloud providers who are deep into the whole multi-cloud is the way and the light and the future because they know if you go all-in on a single cloud provider, it will certainly not be them. And then you have the folks who say, “Go in on one cloud provider and don’t worry about it. It’ll be fine. If you need to migrate down the road, you can do that.”

And I believe that that’s generally the way that you should approach things, but it gets really annoying and condescending when AWS tells that story because from their perspective, yeah, just go all-in and use Dynamo as your data store for everything even though there’s really no equivalent on other cloud providers. Or, “Yeah, go ahead and just tie all of your data warehousing to some of the more intricate and non-replicable parts of S3.” And so on and so forth. And it just feels like they’re pushing a lock-in narrative in many respects. I like having the idea of a strategic Exodus, where if I have to move a thing down the road, I don’t have to reinvent the data model.

And a classic example of what I would avoid in that case is something like Google Spanner—or Google Cloud Spanner, or whatever the one they sell us is—because yeah, it’s great, and it’s awesome. And you wind up with, effectively, what looks like an ACID-compliant SQL database that spans globally. But there’s nothing else quite like that, so if I have to migrate off, it’s not just a matter of changing APIs, I have to re-architect my entire application to be aware of the fact that I can’t really have that architecture anymore, just from a data flow perspective. And looking at this across the board, I find that this is also a bit esoteric because generally speaking, the people who are talking the most about multi-cloud and wanting to avoid lock-in, are treating the cloud like it’s fundamentally an extension of their own crappy data center where they run a bunch of VMs and that’s it.

They say they want to be multi-cloud, but they’re only ever building for one cloud, and everything that they’re building on top of it is just reinventing baseline primitives. “Oh, we don’t trust their load balancers. We’re going to run our own with Nginx or HAProxy.” Great. While you’re doing that, your competitors are getting further ahead.

You’re not even really in the cloud: you basically did the lift part of it, declined to shift, declared victory, and really the only problem you solve for is you suck at dealing with hard drive failure, so you used to deal with outages in your data center and now your cloud provider handles it for you at a premium that’s eye-wateringly high.



Corey: I really love installing, upgrading, and fixing security agents in my cloud estate. Why do I say that? Because I sell things for a company that deploys an agent. There’s no other reason. Because let’s face it; agents can be a real headache. Well, Orca Security now gives you a single tool to detect basically every risk in your cloud environment that’s as easy to install and maintain as a smartphone app. It is agentless—or my intro would have gotten me in trouble here—but it can still see deep into your AWS workloads while guaranteeing 100% coverage. With Orca Security there are no overlooked assets, no DevOps headaches—and believe me, you will hear from those people if you cause them headaches—and no performance hits on live environment. Connect your first cloud account in minutes and see for yourself at orca dot security. That’s orca—as in whale—dot security as in that thing your company claims to care about but doesn’t until right after it really should have.

Corey: Look, I don’t mean to be sitting here saying that this is how every company operates because it’s not. But we see a lot of multi-cloud narrative out there, and what’s most obnoxious about all of it is that it’s coming from companies that are strong enough to stand on their own. And by pushing this narrative, it’s increasingly getting to a point where if you’re not in a multi-cloud environment, you start to think, “Maybe I’m doing something wrong.” You’re not. There’s no value to this.

Remember, you have a business that you’re trying to run, in theory. Or for those of us who are still learning things, yeah, we want to learn a cloud provider before we learn all the cloud providers, let’s not kid ourselves. Pick one, go all-in on for the time being, and don’t worry about what the rest of the industry is doing. We’re not trying to collect them all. There is no Gartner Magic Quadrant for Pokemons and I don’t think the cloud providers should be one of them.

I know I’ve talked about this stuff before, but people keep making the same fundamental errors and it’s time for me to rant on it just a smidgen more than I have already.



Thank you for listening, as always to Fridays From the Field on the AWS Morning Brief. And as always, I’m Chief Cloud Economist Corey Quinn, imploring you to continue to make good choices.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 20 Aug 2021 03:00:00 -0700
The Next Million Cloud Customers

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/the-next-million-cloud-customers



Never miss an episode



Help the show



What's Corey up to?

Wed, 18 Aug 2021 03:00:00 -0700
There's No re:Inforce-ment Learning Without Pavlov's Charlie Bell
AWS Morning Brief for the week of August 16, 2021 with Corey Quinn.
Mon, 16 Aug 2021 03:00:00 -0700
re:Imagining AWS re:Invent

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/re:imagining-aws-re:invent



Never miss an episode



Help the show



What's Corey up to?

Wed, 11 Aug 2021 03:00:00 -0700
Accenture Web Services
AWS Morning Brief for the week of August 9 2021 with Corey Quinn.
Mon, 09 Aug 2021 03:00:00 -0700
How AWS is Still Egregiously Egressing

Links:

Transcript

Corey: This episode is sponsored in part by our friends at ChaosSearch. You could run Elasticsearch or Elastic Cloud—or OpenSearch as they’re calling it now—or a self-hosted ELK stack. But why? ChaosSearch gives you the same API you’ve come to know and tolerate, along with unlimited data retention and no data movement. Just throw your data into S3 and proceed from there as you would expect. This is great for IT operations folks, for app performance monitoring, cybersecurity. If you’re using Elasticsearch, consider not running Elasticsearch. They’re also available now in the AWS marketplace if you’d prefer not to go direct and have half of whatever you pay them count towards your EDB commitment. Discover what companies like Klarna, Equifax, Armor Security, and Blackboard already have. To learn more, visit chaossearch.io and tell them I sent you just so you can see them facepalm, yet again.

Corey: Hi there. Chief Cloud Economist Corey Quinn from the Duckbill Group here to more or less rant for a minute about something it’s been annoying the heck out of me for a while, as anyone who follows me on Twitter or subscribes to the lastweekinaws.com newsletter, or passes me in a crowded elevator will attest to, and that is AWS’s data transfer story.



Back on July 23rd—of 2021, for those listening to this in future years—CloudFlare did a blog post titled AWS’s Egregious Egress, and that was co-authored by Matthew Prince—CloudFlare’s CEO—and Nitin Rao—who is one of their employees. Presumably. That was somewhat unclear—and it effectively tears down the obnoxious—and I mean deeply obnoxious—level of AWS data transfer pricing for egress to the outside world.

And there’s a bunch of things to unpack in this blog post, where they wind up comparing AWS pricing to the wholesale bandwidth market. And they go into a whole depth for those who aren’t aware of how bandwidth is generally charged for. And the markups that they come up with for AWS are, in many cases, almost 8,000%, which is just ludicrous, in some respects, because—spoiler—every year, give or take, the wholesale cost of network bandwidth winds up dropping by about 10%, give or take. And the math that they’ve done that I’m too lazy to check, says that in effect, given that they don’t tend to reduce egress bandwidth pricing, basically ever, while the wholesale market has dropped 93%, what we pay AWS hasn’t. And that’s obnoxious.

They also talk—rather extensively—about how ingress is generally free. Now, there’s a whole list of reasons that this could be true, but let’s face it, when you’re viewing bandwidth into AWS as being free, you start to think of it that way of, “Oh, it’s bandwidth, how expensive could it possibly be?” But when you see data coming out and it charges you through the nose, you start to think that it’s purely predatory. So, it already starts off with customers not feeling super great about this. Then diving into it, of course; they’re pushing for the whole bandwidth alliance that CloudFlare spun up, and good for them; that’s great.

They have a bunch of other providers willing to play games with them and partner. Cool, I get it. It’s a sales pitch. They’re trying to more or less bully Amazon into doing the right thing here, in some ways. Great, not my actual point.



My problem is that it’s not just that data transfer is expensive in AWS land, but it’s also inscrutable because, ignoring for a second what it costs to send things to the outside world, it’s more obnoxious trying to figure out what it costs to send things inside of AWS. It ranges anywhere from free to very much not free. If you have a private subnet that’s talking to something in the public subnet that needs to go through a managed NAT gateway, whatever your transfer price is going to be has four and a half cents per gigabyte added on to it with no price breaks for volume. So, it’s very easy to wind up accidentally having some horrifyingly expensive bills for these things and not being super clear as to why. It’s very challenging to look at this and not come away with the conclusion that someone at the table is the sucker.

And, as anyone who plays poker is able to tell you, if you can’t spot the sucker, it’s you. Further—and this is the part that I wish more people paid attention to—if I’m running an AWS managed service—maybe RDS, maybe DynamoDB, maybe ElastiCache, maybe Elasticsearch—none of these things are necessarily going to be best-to-breed for the solution I’m looking at, but their replication traffic between AZs in the same region is baked into the price and you don’t pay a per-gigabyte fee for this. If you want to run something else, either run it yourself on top of EC2 instances or grab something from the AWS marketplace that a partner has provided to you. There is no pattern in which that cross-AZ replication traffic is free; you pay for every gigabyte, generally two cents a gigabyte, but that can increase significantly in some places.

Corey: I really love installing, upgrading, and fixing security agents in my cloud estate. Why do I say that? Because I sell things for a company that deploys an agent. There’s no other reason. Because let’s face it; agents can be a real headache. Well, Orca Security now gives you a single tool to detect basically every risk in your cloud environment that’s as easy to install and maintain as a smartphone app. It is agentless—or my intro would have gotten me in trouble here—but it can still see deep into your AWS workloads while guaranteeing 100% coverage. With Orca Security there are no overlooked assets, no DevOps headaches—and believe me, you will hear from those people if you cause them headaches—and no performance hits on live environment. Connect your first cloud account in minutes and see for yourself at orca dot
security
. That’s orca—as in whale—dot security as in that thing your company claims to care about but doesn’t until right after it really should have.

Corey: It feels predatory, it feels anti-competitive, and you look at this and you can’t shake the feeling that somehow their network group is being evaluated on how much profit it can turn, as opposed to being the connective tissue that makes all the rest of their services work. Whenever I wind up finding someone who has an outsized data transfer bill when I’m doing the deep-dive analysis on what they have in their accounts, and I talk to them about this, they come away feeling, on some level, ripped off, and they’re not wrong. Now, if you take a look at other providers—like Oracle Cloud is a great example of this—their retail rate is about 10% of what AWS’s for the same level of traffic. In other words, get a 90% discount without signing any contract and just sign the dotted line and go with Oracle Cloud. Look, if what you’re doing is bandwidth-centric, it’s hard to turn your nose up at that, especially if you start kicking the tires and like what you see over there.

This is the Achilles heel of what happens in the world of AWS. Now, I know I’m going to wind up getting letters about this because I always tend to whenever I rant about this that no one at any significant scale is paying retail rate for AWS bandwidth. Right, but that’s sort of the point because when I’m sitting here doing back-of-the-envelope calculations on starting something new and that thing tends to be fairly heavy on data transfer—like video streaming—and I look at the retail published rates, it doesn’t matter what the discount is going to be because I’m still trying to figure out if this thing has any baseline level of viability, and I run the numbers and realize, wow, 95% of my AWS bill is going to be data transfer. Well, I guess my answer is not AWS. That’s not a pure hypothetical.



I was speaking to someone years ago, and they have raised many tens of millions of dollars for their company since, and it’s not on AWS because it can’t be given their public pricing. Look, this is not me trying to beat up unnecessarily on AWS. I’m beating them up on something that frankly, has been where it is for far too long and needs to be addressed. This is not customer obsession; this is not earning trust; this is not in any meaningful way aligned with where customers are and the problems customers are trying to solve. In many cases, customers are going to be better served by keeping two copies of the data, one in each availability zone rather than trying to replicate back and forth between them because that’s what the economics dictate.

That’s ludicrous. It should never be that way. But here we are. And here I am. I’m Chief Cloud Economist Corey Quinn here at the Duckbill Group. Thank you for listening to my rant about AWS data transfer pricing.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 06 Aug 2021 03:00:00 -0700
The Cloud's Competing Approaches to Deprecation

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/The-Clouds-Competing-Approaches-to-Deprecation



Never miss an episode



Help the show



What's Corey up to?

Wed, 04 Aug 2021 03:00:00 -0700
EC2 Classic Shuffleboard
AWS Morning Brief for the week of August 2, 2021, with Corey Quinn.
Mon, 02 Aug 2021 03:30:00 -0700
Optimize Yourself Before You Invest Yourself

Corey: This episode is sponsored in part by our friends at ChaosSearch. You could run Elasticsearch or Elastic Cloud—or OpenSearch as they’re calling it now—or a self-hosted ELK stack. But why? ChaosSearch gives you the same API you’ve come to know and tolerate, along with unlimited data retention and no data movement. Just throw your data into S3 and proceed from there as you would expect. This is great for IT operations folks, for app performance monitoring, cybersecurity. If you’re using Elasticsearch, consider not running Elasticsearch. They’re also available now in the AWS marketplace if you’d prefer not to go direct and have half of whatever you pay them count towards your EDB commitment. Discover what companies like Klarna, Equifax, Armor Security, and Blackboard already have. To learn more, visit chaossearch.io and tell them I sent you just so you can see them facepalm, yet again.

Jesse: Hello, and welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.

Tim: And I’m Tim Banks.

Jesse: This is the podcast within a podcast where we talk about all the ways that we’ve seen AWS used and abused in the wild. Today, we’re going to be talking about the relationship between cost optimization work and investing in reservations or private pricing with AWS. This is kind of a situation conversation. Let’s say you’ve got three months left on your EDP, or maybe your spend is reaching the point where you’re starting to think about investing in, or signing an EDP. But you’ve also got some cost optimization opportunities that you want to work on. How do you prioritize those two ideas?

Tim: I think when we’re talking about this, first it’s important to talk about what goes into an EDP, like, what it is and what it involves. So, EDP for AWS is Enterprise Discount Program, and what it involves is you making a monetary commitment to AWS to spend a certain amount over a certain amount of time. So, a three year EDP, you’re going to spend X amount in one year, X amount the next year, and X amount the third year for a total of whatever you decide on. So, you know, AWS typically going to want 20% year-over-year growth, so you’re going to say—you’re going to spend a million dollars, and then a million dollars plus 20% is something like $1.2 million; then, you know, 20% of that and so forth and
so on.

And then so your total commit will be somewhere around, like, $3.6, $3.7 million, we’ll say, right? Once you signed the EDP, that’s how much you’re going to get billed for, minimum. So, it’s important to cost optimize before you make that commitment because if AWS is expecting you and you’re on the hook to make 20% year-over-year growth, but then you optimize and you save 20% of your bill, it won’t matter because you’re still going to owe AWS the same amount of money even if you cost-optimize.



Jesse: Yeah, I want to take a step back and talk about EDP—as we mentioned, Enterprise Discount Program—also has—there’s a couple other flavors that give you a variety of different types of discounts. EDP generally focuses on a cross-service discount for a certain annual commit, but there are also private pricing agreements or private pricing addendums, and other private pricing, generally speaking, offered by AWS. All of those basically expect some amount of either spend on a yearly basis or some amount of usage on a yearly basis, in exchange for discounts on that usage. And really, that is something that, broadly speaking, we do recommend you focus on, we do recommend that you invest in those reservations, but it is important to think about that—I agree—I would say after cost optimization work.

Amy: The thing is that AWS also provides discounts that are commandment required, that you don’t need an EDP for, namely in reservations and savings plans. So, you would similarly be on the hook if you decide, “I have this much traffic, and I want to savings plan or reservation for it.” And then suddenly you don’t have that requirement anymore, but you still have to make up that commitment.

Tim: I’ll say, I think too, that also matters when you’re looking at things like reservations. If you’re going to reserve instances, you’re going to get an idea of how many you’re specifically going to need, so that way you’re not reserving too many, and then you optimize, you downsize, and all of a sudden, now you have all these reservations that you’re not going to use.

Jesse: One thing to also call out: when renewing an EDP, or private pricing, or when entering into a new agreement for any kind of private pricing with AWS, they will generally look at the last six months of your usage—either broadly speaking if it’s an EDP, or specifically within a specific AWS service if it’s private pricing for a specific service—and they will double, basically, that spend over the last six months and expect you to continue spending that. So, if you spent a high amount of money over the last six months, they’re going to expect that kind of trend to continue, and if you enter into an agreement with that 12-month spend, essentially, going forward, and then make cost optimization changes, you’re ultimately going to be on the hook for this higher level of spending you’re not spending any more. So, if you focus on that cost
optimization work first, it will ultimately give you the opportunity to approach AWS with a lower commit level, which may ultimately mean a lower tier of percentage discount, but ultimately, then you’re not on the hook for spend that you wouldn’t otherwise be spending.



Tim: I think one of the main things people see, too, is when they’ve looked at, like, oh, what’s the low hanging fruit for me to get lower the cost? They’ll think, “Oh, well, I can do EDP,” because AWS is going to want you to sign on; they would love to have that guaranteed money, right? And a lot of times, that’s going to be a much easier thing to do, organizationally, than the work of cost optimization because almost always, that involves engineering hours, it involves planning, it involves some changes that are going to have to be made that’s probably going to be harder than just signing a contract. But again, it’s super necessary because you really need to know, have eyes open, when you’re going to go, and figure out what you’re going to commit, whether it’s private pricing agreement, or an EDP, or reservations. You want to go in there and at least decide what you want to do, what it should look like, get as optimized and as lean as you can, then make your commitments. And then once you get to an EDP, that’s when you’re going to want to do your reservation or savings plans purchases and things like that, so you do that with a discount across those.

Jesse: Yeah, that’s another important thing to point out: focus on the cost optimization work first. Get your architecture, your workloads, as optimized as possible, or as optimized as you can within the given timeframe, then focus on the investment because then you’ll be able to have a much better idea of what your growth is going to look like year-over-year for an EDP or any kind of private pricing. And then after that, purchase any reservations, like reserved instances or savings plans because ultimately, then you get not only the discount from the EDP that you just signed, but any upfront payments that you make, or partial upfront payments that you make for those reservations applied towards your first year EDP. So ultimately, not only are you getting a discount on that, but you are also able to put money towards that first-year commit; you’re essentially giving yourself a little bit more wiggle room by purchasing reservations after you’ve signed an EDP.

Tim: And another way to game that system is if you know that you’re going to be undertaking some projects, especially that you want to get discounts around, and you’re going to need to utilize software or service or anything like that involves an AWS partner on the AWS marketplace, you’re going to want to do that after you sign your EDP, too, because even though you may not get a discount on it, that money will still count towards your commit.

Corey: I really love installing, upgrading, and fixing security agents in my cloud estate. Why do I say that? Because I sell things for a company that deploys an agent. There’s no other reason. Because let’s face it; agents can be a real headache. Well, Orca Security now gives you a single tool to detect basically every risk in your cloud environment that’s as easy to install and maintain as a smartphone app. It is agentless—or my intro would have gotten me in trouble here—but it can still see deep into your AWS workloads while guaranteeing 100% coverage. With Orca Security there are no overlooked assets, no DevOps headaches—and believe me, you will hear from those people if you cause them headaches—and no performance hits on live environment. Connect your first cloud account in minutes and see for yourself at orca dot security. That’s orca—as in whale—dot security as in that thing your company claims to care about but doesn’t until right after it really should have.

Tim: It is important to talk about the future goals for your company, from a financial perspective, both at an architectural level but also at a strategic level, so you can make good quality decisions. And, you know, to toot our own horn, that’s a lot of where our expertise comes in, where we can say, “These are the order you’re going to do these things in, and these are what you should prioritize.” I mean, everyone knows that in the end, the net result should still be the same. You’re going to have to do the engineering and architecture work to optimize; you’re going to have to do the administrative stuff to sign these agreements to get discounts, but you need to know what to prioritize and what’s going to be most important, and sometimes you don’t have the insight on that. And that’s where if you don’t, get someone in there to help you figure out what’s what, what’s going to give you the best, most bang for your buck, but also what’s going to make the most sense for you going forward, six months, a year, two years, three years, and so forth and so on. So, it is okay to not know these things. Nobody’s an expert on everything, but it behooves you to rely on the people who are experts when it’s a blind spot for you.

Jesse: I think that’s a really good point that you make, Tim. One of the things that we see in a number of organizations that we work with is essentially a disconnect between the folks who are—well, two disconnects really: one between the folks who are doing the work day-to-day, and another between the folks who are purchasing reservations. But that also a disconnect between the people who are purchasing the reservations and potentially the people who are purchasing or investing in some kind of Enterprise Discount Program or private pricing. And to Tim’s point, it’s really important to get all of those people in a conversation together, get everybody in a room together, so to speak, to make sure that everybody understands what everybody else is doing so that finance and engineering and product and leadership all understand together that the cost optimization work is going on, that reservations are being purchased, that we’re having a conversation about investing in some kind of private pricing with AWS. So collaboratively, collectively, everybody can make a decision together, make a data-driven decision together, that’s going to ultimately help everybody, essentially, win and accomplish their goals.

Amy: Speaking of collaboration, we often talk about having a good relationship with your AWS account manager, and this is one of those places that having a good rapport really works in your favor because if you are in a lot of communications with your account manager, and you know each other well, and you have a good working relationship, and they are good at their job, then they’ll know that you are using XYZ service, and you’re using at a high volume, they will be able to tell you, it’s like, “Hey, you hit a threshold. Let’s see if we can get you some extra discounts.” They’ll be the ones who can actually know what those discount programs are and be able to facilitate them.

Jesse: All right, well, that will do it for us this week, folks. If you’ve got questions you’d like us to answer please go to lastweekinaws.com/QA; fill out the form and we’ll answer those questions on a future episode of the show. If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us how you would cost-optimize your organization.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 30 Jul 2021 03:00:00 -0700
The Amazonian Evil Infecting AWS

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/the-amazonian-evil-infecting-aws



Never miss an episode



Help the show



What's Corey up to?

Wed, 28 Jul 2021 03:00:00 -0700
Prix Fixe IP Prefixes
AWS Morning Brief for the week of July 26, 2021 with Corey Quinn.
Mon, 26 Jul 2021 03:00:00 -0700
AWS Isn’t a Threat to OSS

Transcript


Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.


Jesse: Hello, and welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.


Amy: I’m Amy Negrette.


Tim: And I’m Tim Banks.


Jesse: This is the podcast within a podcast where we talk about all the ways we’ve seen AWS used and abused in the wild. Today, we’re going to be talking about AWS, an open-source software. Now, that’s kind of a broad topic, but there have been some specific, recent events I’ll say, over the last year maybe or maybe even less, related to AWS and open-source software that really got us talking, and I wanted to have a deeper conversation with both of you on this topic.


Tim: Well, you should probably start by going over some of the things that you’re mentioning, when you say ‘some of these things,’ what are those things, Jesse?


Jesse: Yeah. So, I think the best place to start is what constitutes open-source software. And specifically, I think, not just what constitutes open-source software, but how does that differ from an open-source company?


Tim: So, open-source software can be anything: Linux kernel, bash, anything like that, any Python functioning module. If you make a piece of software, whatever it is, and you license it with one of the various open-source licenses, or your own open-source license or whatever, it’s something that the community kind of owns. So, when they get big, they have maintainers, everything like that, but at its essence, it’s a piece of software that you can freely download and use, and then you’re free to modify it as you need, and then it’s up to the specifics of the license to whether you’re required to send those modifications back, to include them, or to whatever. But the essence is that it’s a piece of software that’s free for me to use and free for me to modify under it’s license.


Jesse: And one of the other things I want to add to that is, correct me if I’m wrong here, but isn’t a lot of open-source software is very community-owned, so there’s a lot of focus on folks from the community that is using this software giving back not because they need to under the licensing, necessarily, but because they want to continue using this and making it better over time.


Amy: I think one of the issues is that becomes a very opinionated kind of statement where there are a lot of people in the open-source community who feel that if you’re going to use something and make changes to better suit what your needs are, that you should be able to submit those changes back to the community, or back to whoever owns the base of the software. But that said, it’s like the community edition of MySQL before Microsoft bought it, where the assumption was that there’s essentially a candidate of it that anyone can use without the expectation of submitting it back.


Jesse: So, that’s a broad definition of open-source software, but how does open-source software, broadly speaking, differ from an open-source company? I’m thinking specifically there is the open-source software of Elasticsearch, for example, or I should say, previously the open-source software of Elasticsearch that was owned by the open-source company, Elastic. So, what does that relationship look like? How does an open-source company like that differ from the open-source software itself?


Tim: So, there are typically a couple of ways. Usually, a company that is the owner of an open-source product still has some kind of retention of the IP in their various licenses that they can do that with, but essentially—and this is in the words of one of the founders of Elastic—that they’re benevolent dictators over the software. And so they allow folks to contribute, but they don’t have to. And most of those open-source software companies will have a commercial version of that software that has other features that are not available, packages with support or some of the things like that, some kind of value-added thing that you’re going to wind up paying for. The best way to describe—like you said—there’s the company Elastic and then the product Elasticsearch.


I relate back to before: there was Red Hat Linux, which was open-source, and then the company Red Hat. And I remember when they went public and everyone was shocked that a company can make profit off of something they gave away for free. But while the core of the software itself was free, the support was not free, nor was the add-on features that enterprises wanted. And so that tends to be kind of what the business model is, is that you create the software, it’s open-source for a while to get a big user base, and then when it gets adopted by enterprises or people that really would pay for support or for other features, that’s when the license tends to change, or there’s a fork between the open-source version and then the commercial version.


Jesse: And it definitely sounds like there can be benefits to an open-source company essentially charging for not just the open-source software, but these extra benefits like supports and additional features because I know I’ve traced multiple code bugs back to a piece of open-source software that there’s a PR or an issue that has been sitting open for months, if not longer because the community just doesn’t have the time to look into the issue, doesn’t have the time to work on the issue, they are managing it on their own, separate as a side job, separate from their day-to-day work. Whereas if that is a bug that I’m tracing back to a feature in an open-source piece of software, or I should say software that I am paying for through an open-source company, I have a much clearer support path to a resolution to resolving that issue.


Tim: And I think what the end up doing is then you see it more like a traditional core software model, like, you know, a la Oracle, or something like that where you pay for the software essentially, but it comes packaged with these things that you get because of it, and then there’s a support contract on top of it, and then there’s hosting or cloud, whatever it is, on top of that, now, but you would still end up paying for the software and then support as part of the same deal. But as you know, these are for-profit companies. People get paid for them; they are publicly traded; they sell this software; they sell this product, whether it’s the services or the hosting, for profit. That is not open-source software. So, if company X that makes software X, goes under, they are acting like the software would then go under as if the software doesn’t belong to the community.


So, a business that goes after a business is always going to be fair play; I believe they call it capitalism. But when you talk about going after open-source software, you’re looking at what Microsoft was doing in the ’90s and early 2000s, with Linux and other open-source challenges to the Windows and the other paid commercial enterprise software market. When folks started using Linux and servers because it was free, customizable, and they could do pretty much everything they wanted to or version of it that they were using commercial Unices for, or even replacing Windows for, you didn’t really see the commercial Unices going after it because that very specialized use cases; the user had specialized hardware. What folks were doing, they’re buying Wintel machines and putting Linux on them, they were getting them without Windows licenses, or trial licenses, throwing Linux on it. And Microsoft really went after open-source; they really went after open-source.


They were calling it insecure, they were calling it flash in the pan, saying it would never happen. They ran a good marketing campaign for a long time against open-source software so that people would not use it and would instead use their closed-source software. That is going after open-source, not going after quote-unquote, “Open-source companies.”


Jesse: Yeah, I think that’s ultimately what I want to dive into next, which is, there’s been a lot of buzz about AWS going after open-source, being a risk to open-source software, specifically, with the release of AWS Managed Services for software like Elasticsearch, for example, Kubernetes, Prometheus vs. Other open-source packages that you can now run as a managed service in AWS. There’s a lot of concern that AWS is basically a risk to all of these pieces of open-source software, but that doesn’t necessarily seem to be the case, based on what we’re talking about. One of the things that I want to dive into really specifically here is this licensing idea. Is it important to end-users? How would they know about what license they’re using, or if the license changes?


Tim: I’ll let Amy dig in on it because she’s probably the expert of three of them, but I will say one case in point, I remember where licensing did become very important was Java. JDK licenses, when Oracle started cornering the market on enclosing all the licenses, you had to use different types of Javas. So, you had to get, like, open JDK; you couldn’t use Sun, Oracle Java, or whatever it was. And so that became a heavy lift of replacing packages and making sure all that stuff was in compliance, and while tracking packages, replacing them, doing all the necessary things because if you’re running Java, you’re probably running it in production. Why you would, I don’t know, but there are those things that you would have to do in order to be able to just replace a package. The impact of the license, even if it doesn’t cost a dime for usage, it still matters, and in real dollars and real engineering time.


Amy: Even free licensing will cost you money if you do it wrong. The reason why I love talking about licensing is because I used to work for the government—


Jesse: [laugh].


Amy: —and if you think a large company like Amazon or Microsoft loves doing anything to rattle the cage of smaller businesses, it’s not nearly as much as they love doing it to the government. So, any company that has a government-specific license, and the government is not using it, they will get sued and fined for a bunch of money, which sounds like a conflict between a super-large company and the government and who the hell cares about that, but this also translates the way they handle licensing for end-users and for smaller companies. So, for the most part for the end-user, you’re going to look at what is generally sent to you to use any piece of licensing, the EULA, the End-User License Agreement, and you’re just going to say, “Yeah, fine, this thing is 20 pages long; I’m not going to read this, it’s fine.” And for most end-users, that is actually, you’re good to go because they’re not going to be coming after small, single-person users. What these licenses do is restrict the way larger organizations—be it the government or mid to larger companies—actually use their software, so that—this is a little dating—someone does not buy a single disk that does not report home, and then install that one disk on 20 computers, which is a thing that everyone has seen done if they’ve been in the industry long enough.


Jesse: Yeah.


Amy: Yeah. And it means things like licensing inventory is important, to the single you’re using this license at home and you install Adobe on three computers, you would think it’s not… would not hurt their value very much, but they also make it so that you can’t even do that anymore. So, in purchased software, it makes a big deal for end-users; if it’s just something free like being able to use some community SQL workbench just to mess around with stuff at home or on personal projects, you’re usually going to be okay.


Corey: This episode is sponsored in part by our friends at ChaosSearch. You could run Elasticsearch or Elastic Cloud—or OpenSearch as they’re calling it now—or a self-hosted ELK stack. But why? ChaosSearch gives you the same API you’ve come to know and tolerate, along with unlimited data retention and no data movement. Just throw your data into S3 and proceed from there as you would expect. This is great for IT operations folks, for app performance monitoring, cybersecurity. If you’re using Elasticsearch, consider not running Elasticsearch. They’re also available now in the AWS marketplace if you’d prefer not to go direct and have half of whatever you pay them count towards your EDB commitment. Discover what companies like HubSpot, Klarna, Equifax, Armor Security, and Blackboard already have. To learn more, visit chaossearch.io and tell them I sent you just so you can see them facepalm, yet again.


Jesse: Yeah, this is a really big issue. There’s so much complexity in this space because Tim, like you said, there’s some amount of capitalism here of AWS competing with open-source companies; there’s business opportunities to change licensing, which can be a good thing for a company or it could be a terrible thing for a company’s user base. There’s lots of complexity to this issue. And I mean, in the amount of time that we’ve been talking, we’ve only really scratched the surface. I think there’s so much more to this space to talk about.


Tim: There really is, and there’s a lot of history that we really need to cover to really paint an accurate picture. I think back when web hosting first became a thing, and everyone was running LAMP stacks and nobody was saying, “Oh, no, using cPanel is going to kill Apache.” That wasn’t a thing because, yeah, it was a for-profit company that was using open-source software to make money and yet Apache still lived, and [unintelligible 00:15:00] still lived; MySQL still made it; PHP was still around. So, to say that utilizing open-source software to provide a service, to provide a paid service, is going to kill the open-source softwares, at best it’s misrepresentation and omits a lot of things. So, yeah, there’s a lot of stuff we can dig into, a lot of things we can cover.


And the topic is broad, and so this is why it’s important for us to talk about it, I think, in the context of AWS and the AWS, kind of, ecosystem is that when you see companies with big crocodile tears, saying, “Oh, yeah, AWS is trying to kill open-source,” it’s like, “No, they’re not trying to kill open-source.” They may be trying to go after your company, but they aren’t the same.


Jesse: And it feels to me like that is part of the way that the business world works. And I’m not saying that it’s a great part of the way the business world works, but how can you differentiate your company in such a way that you still retain your user base if AWS releases a competing product? I’m not thrilled with the fact that AWS is releasing all these products that are competing with open-source companies, but I’m also not going to say that it’s not beneficial, in some ways, for AWS customers. So, I see both sides of the coin here and I don’t have a clear idea of what the best path forward is.


Amy: As much as I hate the market demands it type of argument, a lot of the libraries, and open-source software, and all of these other things that AWS has successfully gone after, they’ve gone after ones that weren’t entirely easy to use in the first place. Things like Kubernetes, and Prometheus, and MongoDB, and Elastic. These are not simple solutions to begin with, so if they didn’t do it, there are a lot of other management companies that will help you deal with these very specific products. The only difference is, one of them is AWS.


Jesse: [laugh]. One of them is a multibillion-dollar company.


Amy: Oh, they’ve all got money, man.


Jesse: [laugh].


Amy: I mean, let’s be real. At our pay grade, the difference between a multimillion-dollar and a billion-dollar company, I don’t think affects you at your level at all.


Jesse: No.


Amy: I’m not seeing any of that difference. I am not. [laugh].


Tim: Yeah, I definitely think if you all want us to dig into more of this—and we could do a lot more—let us know. If there are things you think we’re wrong on, or things that you think we need to dig deeper on, yeah, we’d love to do that. Because this is a complex and nuanced topic that does have a lot of information that should be discussed so that folks can have a clear view of what the picture looks like.


Jesse: Well, that’ll do it for us this week, folks. If you’ve got questions you’d like us to answer please go to lastweekinaws.com/QA, fill out the form and we’ll answer those questions on a future episode of the show.


If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us your thoughts on this conversation, on AWS versus open-source software versus open-source companies.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 23 Jul 2021 03:00:00 -0700
The Great Lie

Want to give your ears a break and read this as an article? You’re looking for this link: https://www.lastweekinaws.com/blog/the-great-lie/



Never miss an episode



Help the show



What's Corey up to?

Wed, 21 Jul 2021 03:00:00 -0700
The Festival of Quinns
AWS Morning Brief for the week of July 19, 2021 with Corey Quinn.
Mon, 19 Jul 2021 03:00:00 -0700
AWS Application Cost Profiler

Transcript

Corey: This episode is sponsored in part byLaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visitlaunchdarkly.com and tell them Corey sent you, and watch for the wince.



Jesse: Hello, and welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.

Tim: And I’m Tim Banks.



Jesse: This is the podcast within a podcast where we talk about all the ways we’ve seen AWS used and abused in the wild with a healthy dose of complaining about AWS for good measure. Today, we’re going to be talking about a recent addition to the AWS family: AWS Application Cost Profiler.

Tim: But hold on for a second, Jesse, because AWS Application Cost Profiler we can get to; that’s rather unremarkable. I really want to talk about how impressed I am with AWS InfiniDash. I’ve been benchmarking this thing, and it is fan… tastic. It’s so good. And we could probably talk about for a while, but suffice to say that I am far more impressed with AWS InfiniDash than I am with AWS Application Cost Profiler.



Jesse: You know, that’s fair. And I feel like InfiniDash should absolutely get credit where credit is due. I want to make sure that everybody can really understand the full breadth of everything that InfiniDash is able to accomplish. So, I want to make sure that we do get to that; maybe in a future episode, we can touch on that one. But for right now, I have lots of feelings about AWS Application Cost Profiler, and what better place to share those feelings than with two of my favorite people, Amy and Tim, and then all of you listeners who are listening in to this podcast. I can’t wait to dive into this. But I think we should probably start with, what is AWS Application Cost Profiler?

Amy: It is [unintelligible 00:01:54] in a trench coat.

Jesse: [laugh].



Amy: Which is the way AWS likes to solve problems sometimes. And in this case, it’s talking about separating billing costs by tenants by service, which is certainly a lot of things that people have problems with.

Jesse: That is a lot of buzzwords.

Amy: A lot of words there.

Jesse: Yeah. Looking at the documentation, the sales page, “AWS Application Cost Profiler is a managed service that helps us separate your AWS billing and costs by the tenants of your service.” That has a lot of buzzwords.



Tim: Well, to be fair, that’s also a majority of the documentation about service.



Jesse: Yeah, that is fair. That is a lot of what we saw, and I think we’ll dive into that with documentation in a minute. But I do want to call out before we dive into our thoughts on this service—because we did kick the tires on this service and we want to share what our experience was like, but I do want to call out that this problem that AWS Application Cost Profiler is trying to solve. This idea of cost allocation of shared resources, it is a real, valid problem and it is one that is difficult to solve.

Amy: And we’ve had clients that have had this very explicit problem and our findings have been that it’s very difficult to accurately splice usage and spend against what’s essentially consumption-based metrics—which is how much a user or request is using all the way along your pipeline—if they’re not using dedicated resources.

Jesse: Yeah, when we talk about cost allocation, generally speaking, we talk about cost allocation from the perspective of tagging resources, broadly speaking, and moving resources into linked accounts and separating spend by linked accounts, or allocating spend by linked accounts. But if you’ve got a shared compute cluster, a shared database, any kind of shared resources where multiple tenants are using that infrastructure, slapping one tag on it isn’t going to solve the issue. Even putting all of those shared resources in a single linked account isn’t going to solve that issue. So, the problem of cost allocation for shared resource is real; it is a valid problem. So, let’s talk specifically about AWS Application Cost Profiler as a solution for this problem. As I mentioned, we kicked the tires on this solution earlier this week and we have some thoughts to share.

Tim: I think one of the main things around this AWS Application Profiler like I said, there’s some problems that can be solved there, there’s some insights that people really want to gain here, but the problem is people don’t want to do a lot more work or rewrite their observability stack to do it. The problem is, that’s exactly what AWS Cost Profiler seems to be doing or seems to want you to do. It doesn’t get data from, I think it only gets data from certain EC2 services, and it’s just, it’s doing things that you can already do in other tools to do aggregation. And if I’m going to do all the work to rewrite that stack, to be able to use the Profiler, am I going to want to spend that time doing something else? I mean, that kind of comes to the bottom line about it.

Jesse: Yeah, the biggest thing that I ran into, or that I experienced when we were setting up the Cost Profiler, is that documentation basically said, “Okay, configure Cost Profiler and then submit your data.” And [unintelligible 00:05:54] stop, like wait, what? Wait, what do you mean, ‘submit data?’ And it said, “Okay, well now that you’ve got Cost Profiler as a service running, you need to upload all of the data that Cost Profiler is going to profile for you.” It boggles my mind.

Tim: And it has to be in this format, and it has to have these specific fields. And so if you’re not already emitting data in that format with those fields, now you have to go back and do that. And it’s not really solving any problems, but it offers to create more problems.

Amy: And also, if you’re going to have to go through the work of instrumenting and managing all that data anyway, you could send it anywhere you wanted to. You could send it to your own database to your own visualization. You don’t need Profiler after that.



Jesse: Yeah, I think that’s a really good point, Amy. AWS Cost Profiler assumes that you already have this data somewhere. And if not, it explicitly says—in its documentation it says, to generate reports you need to submit tenant usage data of your software applications that use shared AWS resources. So, it explicitly expects you to already have this data. And if you are going to be looking for a solution that is going to help you allocate the cost of shared resources and you already have this data somewhere else, there are better solutions out there than AWS Application Cost Profiler. As Amy said, you can send that data anywhere. AWS Application Cost Profiler probably isn’t going to be the first place that you think of because it probably doesn’t have as many features as other solutions.

Amy: If you were going to instrument things to that level, and let’s say you were using third-party services, you could normalize your own data and build out your own solution, or you can send it to a better data and analytics service. There are more mature solutions out there that require you to do less work.

Corey: This episode is sponsored in part by ChaosSearch. You could run Elastic Search or Elastic Cloud or Open Search, as they’re calling it now, or a self hosted out stack. But why? ChaosSearch gives you the same API you’ve come to know and tolerate, along with unlimited data retention and no data movement. Just throw your data into S3 and proceed from there as you would expect. This is great for IT operations folks, for App performance monitoring, cyber security. If you’re using ElasticSearch consider not running ElasticSearch. They’re also available now on the AWS market place, if you prefer not to go direct and have half of whatever you pay them count toward your EDP commitment. Discover what companies like, Klarna, Equifax, Armor Security and Blackboard already have. To learn more visit chaossearch.io and tell them I sent you just so you can see them facepalm yet again.

Jesse: I feel like I’d missed something, broadly speaking. I get that this is a preview, I get that this is a step on the road for this solution, and I’m hoping that ultimately AWS Application Cost Profiler can automatically pull data from resources. And also, not just from EC2 compute resources, but from other shared services as well. I would love this service to be able to automatically dynamically pull this data from multiple AWS services that I already use. But this just feels like a very minimal first step to me.

Tim: And let’s be honest; AWS has a history of putting out services before they’re ready for primetime, even if they’re GA—

Jesse: Yeah.

Tim: —but this seems so un-useful that I’m not sure how it made it past the six-pager or the press release. It’s disappointing for a GA service from AWS.

Amy: What would you both like to see, other than it just being… more natively picked up by other services?

Tim: I would like to see either a UI for creating the data tables that you’re going to need, or a plugin that you can automatically put with those EC2 resources: an agent you can run, or a sidecar, or a collector that you just enable to gather that data automatically. Because right now, it’s not really useful at all. What it’s doing is basically the same thing you can do in an Excel spreadsheet. And that’s being very, very honest.

Jesse: Yeah, I think that’s a really good point that ultimately, a lot of this data is not streamlined and that’s ultimately the thing that is the most frustrating for me right now. It is asking a lot of the customer in terms of engineering time, in terms of design work, in terms of implementation details, and I would love AWS to iterate on this service by providing that dynamically, making it easier to onboard and use this service.

Amy: Personally, what I would like is some either use case, or demonstration, or tutorial that shows how to track consumption costs using non-compute resources like Kinesis especially, because you’re shoving a lot of things in there and you just need to be able to track these things and have that show up in some sort of visualization that’s like Cost Explorer. Or even have that wired directly to Cost Explorer so that you can, from Cost Explorer, drill down to a request and be able to see what it is actually doing, and what it’s actually costing. I want a lot of things.

Jesse: [laugh]. But honestly, I think that’s why we’re here, you know? I want to make these services better. I want people to use the services. I want people to be able to allocate costs of shared resources. But it is still a hard problem to solve, and no one solution has quite solved it cleanly and easily yet.

You know what? Amy, to get back to your question, that’s ultimately what I would love to see, not just specifically with an AWS Application Cost Profiler necessarily, but I would love to see better native tools in AWS to help break out the cost of shared resources, to help break out and measure how tenants are using shared resources in AWS, natively. More so than this solution.

Amy: I would love that. It would make so many things so much easier.

Jesse: Mm-hm. I’m definitely going to be adding that to my AWS wishlist for a future episode.

Tim: How many terabytes is your AWS wishlist right now?

Jesse: Oh… it is long. I, unfortunately, have made so many additions to my AWS wishlist that are qualitative things—more so than quantitative things—that just aren’t going to happen.

Amy: You become that kid at Christmas that, they get onto Santa’s lap in the mall, and it’s a roller page that just hops off the platform, and just goes down the wall, and all the other kids are staring at you and ready to punch you in the face when you get off. [laugh].

Jesse: [laugh]. All right, well that’ll do it for us this week, folks. If you’ve got questions you’d like us to answer please go to lastweekinaws.com/QA, fill out the form and we’d be happy to answer that question on a future episode. If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us how you allocate the costs of shared resources.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 16 Jul 2021 03:00:00 -0700
Corey Writes Open-Source Code for Lambda and Tailscale

Want to give your ears a break and read this as an article? You’re looking for this link.

https://www.lastweekinaws.com/blog/Corey-Writes-Open—Source-Code-for-Lambda-and-Tailscale



Never miss an episode



Help the show



What's Corey up to?

Wed, 14 Jul 2021 03:00:00 -0700
The Transitive Property of Cloud Bills
AWS Morning Brief for the week of July 12, 2021 with Corey Quinn.
Mon, 12 Jul 2021 03:00:00 -0700
AWS Account Teams and You

Links


Transcript
Corey: If your mean time to WTF for a security alert is more than a minute, it's time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you're building a secure business on AWS with compliance requirements, you don't really have time to choose between antivirus or firewall companies to help you secure your stack. That's why Lacework is built from the ground up for the Cloud: low effort, high visibility and detection. To learn more, visit lacework.com.

Jesse: Hello, and welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.


Tim: And I’m Tim Banks.

Jesse: This is the podcast within a podcast where we talk about all the ways we’ve seen AWS used and abused in the wild, with a healthy dose of complaining about AWS for good measure. Today, we’re going to be talking about, really, a couple things; building your relationship with AWS, really. This stems from one of the questions that we got from a listener from a previous event. The question is, “How do the different companies that we’ve worked with work with AWS? Is the primary point of contact for AWS at a company usually the CTO, the VP of engineering, an architect, an ops person, a program manager, or somebody from finance, a [unintelligible 00:01:00] trainer? Who ultimately owns that relationship with AWS?”

And so we’re going to talk about that today. I think there’s a lot of really great content in this space. Pete and I, back in the day, recorded an episode talking about building your relationship with your account manager, and with your TAM, and with AWS in general. I’ll link that in the show notes. That’s a great precursor to this conversation. But I think there’s a lot of great opportunities to build your relationship and build rapport with AWS, as you work with AWS and as you put more things on the platform.

Amy: I think one of the things we always say right off the bat is that you should introduce yourself and make a good relationship with your account manager and your technical account manager, just because they’re the ones who, if you need help, they’re going to be the ones to help you.

Jesse: Yeah, I think one of the things that we should also take a step back and add is that if you are listening to this and you’re saying to yourself, “I don’t have an account manager,” that’s actually wrong; you do have an account manager. Anybody who’s running workloads on AWS has an account manager. Your account manager might not have reached out to you yet because usually speaking, account managers don’t reach out unless they see that you’re spending a certain amount of money. They usually don’t start a conversation with you unless you specifically are spending a certain amount of money, have reached a certain threshold, and then they want to start talking to you about opportunities to continue using AWS, opportunities to save money, invest in AWS. But you definitely have an account manager and you should definitely start building that rapport with them as soon as possible.

Amy: First question. How do you actually engage your account manager?

Tim: So, there’s a couple ways to do it. If you have reached a certain spend threshold where your account manager will reach out to you, it’s real simple: you just reply back to them. And it kind of depends. The question most people are going to have is, “Well, why do I need to reach out to my account manager? If I just have, like, a demo account, if I’m just using free tier stuff.”

You probably don’t ever need to reach out to your account manager, so what are the things, typical things that people need to reach out to their account manager for? Well, typically because they want to grow and want to see what kind of discounts are offered for growth, and I want to see what I can do. Now, you can open a support ticket, you can open a billing ticket, but what will end up happening is once you reach a spend threshold, your account manager will reach out to you because they want to talk to you about what programs they have, they want to see how they can help you grow your account, they want to see what things they can do for you because for them, that means you’re going to spend more money. Most account managers within a little bit of time of you opening your account and reaching a lower spend threshold, they’re going to send you an email and say, “Hey, this is my name, this is how you reach me,” et cetera, et cetera. And they’ll send you some emails with links to webinars or other events and things like that, and you can typically reply back to those and you’ll be able to get your account manager sometimes as well. But like I said, the easiest way to get a hold of your account manager or find out who it is, is to start increasing your spend on AWS.

Jesse: So, then if you’re a small company, maybe a startup or maybe just a student’s using AWS for the first time, likely that point of contact within a company is going to be you. From a startup perspective, maybe you are the lead engineer, maybe you are the VP of engineering, maybe you are the sole engineer in the company. We have seen most organizations that we talk to have a relationship with AWS, or build that relationship or own that relationship with AWS at a engineering management or senior leadership level. Engineering management seems to be the sweet spot because usually, senior leadership has a larger view of things on their plate than just AWS so they’re focused on larger business moves for the company, but the engineering manager normally has enough context and knowledge of all of the day-to-day specifics of how engineering teams are using AWS to really be involved in that conversation with your account manager, with your technical account manager, or with your solutions architect, or whatever set of folks you have from AWS’s side for an account team. And I think that’s another thing that we should point out as well, which is, you will always have an account manager; you won’t always have a technical account manager.

The technical account manager generally comes in once you have signed an enterprise discount program agreement. So, generally speaking, that is one of the perks that comes with an EDP, but obviously, there are other components to the EDP to be mindful of as well.

Tim: So, let me clarify that. You get a technical account manager when you sign up for enterprise support. You don’t have to have an EDPs to have enterprise support, but when you sign up for enterprise support, you automatically get a technical account manager.

Jesse: And, Tim, if you could share with everybody, what kind of things can you expect from a technical account manager?

Tim: So, a technical account manager, I mean, they will do—like, all TAMs everywhere pretty much can liaise with support to escalate tickets or investigate them and see what’s going on with them, try and, kind of, white-glove them into where they need to be. AWS TAM’s, they also have the same—or a lot of the same access to the backend. Not your data because no one at AWS actually has access to your data or inside your systems, but they have access to the backend so they can see API calls, they can see logs, and they can see other things like that to get insight into what’s going on in your system and so they can do analytics. They have insight to your billing, they can see your Cost Explorer, they can see what your contract spends are, they can see all the line items in your bills, they have access to the roadmaps, they have access to the services and the service teams so that if you need to talk to someone at a particular service team, they can arrange that meeting for you. If you need to talk to specialists SAs, they can arrange those meetings for you.

With a TAM, you—and if you have enterprise support, and they’re looking you for an EDP, you can have what’s called an EBC or an Executive Briefing Council, where they, in non-pandemic times, they will bring you to Seattle, put you up for a couple of days and you’ll have a couple of days of meetings with service teams to go over, kind of like, what the roadmap looks like, what your strategy for working with those teams are or working with those services are. And you can get good steps on how to utilize these services, whether it’s going to be some more deep dives on-site, or whether it’s going to be some key roadmap items that the service team is going to prioritize and other things like that. And the EBC is actually pretty neat, but you know, you have to be larger spender to get access to those. Another thing that a TAM can do is they can actually enter items on the roadmap for you. They have access to and can provide you access to betas, or pilot programs, or private releases for various services.

You’ll have access to a weekly email that include what launches are pending, or what releases are pending over the next week or two weeks. You’ll have access to quarterly or monthly business reviews where you get access to see what your spend looks like, what your spending trends are, support ticket trends, you know, usage and analytics, and things like that. So, a TAM can be quite useful. They can do quite a lot for you, especially in the realm of cloud economics. That said, every TAM has their specialty.

I mean, depending on how many customers they have, the level of engagement you may get. And, you know, some TAMs are super, super, really good at the financial aspects, some are better at the technical aspects. So, to be fair because the TAM org is so large at AWS, you don’t always have the same experience with all your TAMs, and the level of depth to which they can dive is going to vary somewhat.

Corey: This episode is sponsored in part by ChaosSearch. You could run Elastic Search or Elastic Cloud or Open Search, as they’re calling it now, or a self hosted out stack. But why? ChaosSearch gives you the same API you’ve come to know and tolerate, along with unlimited data retention and no data movement. Just throw your data into S3 and proceed from there as you would expect. This is great for IT operations folks, for App performance monitoring, cyber security. If you’re using ElasticSearch consider not running ElasticSearch. They’re also available now on the AWS market place, if you prefer not to go direct and have half of whatever you pay them count toward your EDP commitment. Discover what companies like, Klarna, Equifax, Armor Security and Blackboard already have. To learn more visit chaossearch.io and tell them I sent you just so you can see them facepalm yet again.

Amy: So, let’s say we got the best TAM—even though he technically works for us now—when trying to envision what our relationship with the world’s best TAM is going to be—and I just imagine that as a nice little block text on a white mug—what is that relationship going to look like? How are we going to engage with them? And even, how often should we talk to them?

Jesse: I used to work for an organization that had, I believe, quarterly meetings with our account manager and our TAM, and every time we met with them, it felt like this high stakes poker game where we didn’t want to show our cards and they didn’t want to show their cards, but then nobody really was able to do anything productive together. And I have to say that is the exact opposite of how to engage your account manager and your TAM.

Tim: Yeah, that doesn’t sound great.

Jesse: No, it was not great. I do not recommend that. You want to have an open, honest conversation about your roadmap, about what you want to do with AWS.

Amy: They’re not getting that mug.

Tim: No, no.

Jesse: [laugh].

Tim: So, if you have a super-engaged TAM—and I will use my own experience as a TAM at AWS—that we had office hours, routinely, bi-weekly. One customer I had, I would have onsite office hours at their offices in LA, and I would have virtual office hours in offices in London. And those office hours, sometimes I’d have—we—that—we would use those to bring in, whether it was specialist SAs, whether we go over roadmap items, or tickets, or something like that, or we do architectural reviews, or cost reviews, we would schedule quarterly business reviews aside from that, typically sometimes the same day or on the same group of days, but there was typically be different than office hours. I was in their Slack channel so they needed to ping me on something that’s not a ticket but a question, we could have conversations in there. A couple of their higher points of contact there had my phone number, so they would call me if something was going on. They would page me—because AWS TAMS have pagers—if they had a major issue, or, like, an outage or something [unintelligible 00:11:05] that would affect them.

Jesse: I’m sorry, I just have to ask really quick. Are we talking, like, old school level pager?

Tim: No, no, no. Like on your phone, like PagerDuty.

Jesse: Okay, okay. I was really excited for a minute there because I kind of miss those old-school pagers.

Tim: Let me say, it was like PagerDuty; it wasn’t actual PagerDuty because AWS did not actually use PagerDuty. They had something internal, but PagerDuty was the closest analog.

Amy: Internal PagerDuty as a Service.

Tim: Something like that.

Jesse: Oh, no.

Amy: So, you know, if you have a very engaged TAM, you would have regular, several times a week, contact if not daily, right? Additionally, the account team will also meet internally to go over strategy, go over issues, and action items, and things like that once or twice a week. Some accounts have multiple TAM, in which case then, you know, the touchpoints are even more often.



Jesse: I feel like there’s so much opportunity for engagement with your AWS account team, your account manager, your TAM. It’s not entirely up to you to build that relationship, but it is a relationship; it definitely requires investment and energy from both sides.

Tim: And I would say in the context of who’s working with a TAM, ideally, the larger contact paths you have at an org with your TAM, the better off it’s going to be. So, you don’t want your TAM or account team to only talk to the VP of engineering, or the DevOps manager, or the lead architect; you want them to be able to talk to your devs, and your junior devs, and your finance people, and your CTO, and other folks like that, and pretty much anyone who’s a stakeholder because they can have various conversations, and they can bring concerns around. If they’re talking about junior devs, your TAM can actually help them how to use CloudFormation, and how to use a AWS CLI, or do a workshop on the basics of using Kubernetes, or something like that. Whereas if you’re going to have a conversation with the VP of engineering, they’re going to talk about strategies, they’re going to talk about roadmap items, they’re going to talk about how things can affect the company, they’re going to talk about EDPs and things like that. So ideally, in a successful relationship with your TAM, your TAM is going to have several people in your org are going to have that TAM’s contact information and will talk with them regularly.

Jesse: One of the clients that we worked with actually brought us in for a number of conversations, and brought their TAM in as part of those conversations, too. And I have to say, having the TAM involved in those conversations was fantastic because as much as I love the deep, insightful work that we do, there were certain things about AWS’s roadmap that we just don’t have visibility into sometimes. And the TAM had that visibility and was able to be part of those conversations on multiple different levels. The TAM was able to communicate to multiple audiences about both roadmap items from a product perspective, from a finance perspective, from an engineering architecture perspective; it was really great to have them involved in the conversation and share insights that were beneficial for multiple parties in that meeting.

Tim: And oftentimes, too, involving your TAM when you do have this one thing in your bill you can’t figure out, saying, “We’ve looked and this spend is here, but we don’t know exactly why it is.” Your TAM can go back and look at the logs, or go back and look at some of the things that were spun up at the specific time and say, “Oh, here was the problem. It was when you deploy this new AMI, it caused your CPU hours to go way, way up so you had to spin up more instances.” Or a great one was a few years back when Datadog changed its API calls and a lot of people’s CloudWatch costs went through the roof. And then several TAMs had to through and figure out, it was this specific call and this is how you fix that and give that guidance back to their customers to reduce their spend. So, being able to have that backend access is very, very useful, even when you are working with an optimization group like ourselves or other folks, to say, “Hey, we’ve noticed these things. These are the line items we want to get some insight into.” I mean, your TAM can definitely be a good partner in that.

Jesse: All right, folks, well, that’ll do it for us this week. If you’ve got questions that you’d like us to answer, please go to lastweekinaws.com/QA. Fill out the form; we’d be happy to answer those on a future show. If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review. Give it a five-star rating on your podcast platform of choice and tell us, did Tim pronounce the shortening of ‘Amazon Machine Image’ correctly as ‘ah-mi’ or should he have said ‘A-M-I?’

Amy: I heard it and I wasn’t going to say it. [laugh].

Jesse: [laugh].

Amy: I was just going to wait for someone to send him the t-shirt.

Tim: Just to note, if you put beans in your chili, you can keep your comments to yourself.

Jesse: [laugh].

Amy: You’re just going to keep fighting about everything today, is all I’m—[laugh].

Jesse: [laugh]. Oh, no.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 09 Jul 2021 03:00:00 -0700
The Lessons of AWS Infinidash

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/the-lessons-aws-infinidash



Never miss an episode



Help the show



What's Corey up to?

Wed, 07 Jul 2021 03:00:00 -0700
Andy Jassy Infinidashes Upstairs
AWS Morning Brief for the week of July 5, 2021 with Corey Quinn.
Mon, 05 Jul 2021 03:00:00 -0700
Tagging Isn’t Just About Cost

Links:


Transcript

Corey: If your mean time to WTF for a security alert is more than a minute, it's time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you're building a secure business on AWS with compliance requirements, you don't really have time to choose between antivirus or firewall companies to help you secure your stack. That's why Lacework is built from the ground up for the Cloud: low effort, high visibility and detection. To learn more, visit lacework.com.

Jesse: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.

Tim: And I’m Tim Banks.

Jesse: This is the podcast within a podcast where we talk about all the ways we’ve seen AWS used and abused in the wild, with a healthy dose of complaining about AWS for good measure. Today, we’re actually going to talk about a very specific listener question that we didn’t get to last week, but really, we had so many thoughts on this topic that we wanted to break it out into its own episode. So, today we’re going to be talking about tagging, and the importance of tagging, and how tagging can be used. And when I say tagging, specifically we’re talking about user-defined cost allocation tags. The original question that I’ll read off was from [Aaron 00:00:58].

Aaron asks, “Is tagging over-recommended as a cost reporting mechanism? I recently took on managing my company’s AWS bill and when talking to AWS and reading third-party blog posts about cost management, a solid tagging strategy is often extolled this step zero for understanding AWS costs. Based on what I know about AWS so far, this approach seems like it may work for some aspects of cost management, but does not seem to be a sound strategy for more formal cost reporting, like budgeting or calculating total spend for a given product or cost center. To me, these activities require complete or near-complete accuracy the tags just don’t seem to be able to provide since there are some costs like data transfer that aren’t tagged, and the fact that the tags are not retroactive—” that’s a big one that I can say is super frustrating for me. “Is there something I’m missing here? Is there in fact, a way to use these tags to ensure that 100% of an AWS account’s costs are in fact attributed back to a specific cost center accurately? It seems drastically simpler to embrace a multi-account strategy where each account is simply billed to whatever cost center makes sense to the organization.” So, Amy and Tim, again, the main question here is, is tagging over-recommended as a cost reporting mechanism?

Tim: The simple answer is no, it is not over-recommended. And the question makes a lot of good points around some of the heartaches and some the problems that come with tagging, specifically about tags not being retroactive, but, if you’re going to make changes to reflect changes in the past, I mean, you know, I don’t really have a good answer for that, if we’re being honest. But if we’re talking about going forward, tracking costs from this point forward, tagging is going to be a much more concise solution than using multi-account strategy. That said, there are a lot of reasons you should use multi-account strategy and tagging together. Multi-account strategy and tagging strategies should definitely be an ‘and’ situation, not an ‘or’ situation. That’s like pizza or steak. No. It’s both pizza and steak.

And I feel like because there are a number of non-cost reasons to use multiple accounts, especially in AWS, the biggest concern of which are service limits, right? Service limits, as you know, are done by account by region, so, if I have a service limit of S3 buckets that I can create—and I think that the hard limit is, like, one thousand—once I need that one thousandth and one S3 bucket, I have to create another account. That account can still be production, it can still be for all the same things that I’ve used for anything else, but I had to add another account so I can spin up S3 buckets. So, how do I track those, what those buckets are for, what those costs are going to be? I’m going to track those with tags.

And I’m going to track those tags from the payer account, or from up in the organization. So, as you set up multiple accounts, you can have—even if they’re all production, they still need to be tagged. Even if they’re all dev, they still need to be tagged. If you’re using the account vending machine style stuff from Control Tower where you spin up a sandbox account, you run some stuff, and then you throw it away, tagging is going to be the best way to track those costs, not just the fact that this account is named a certain thing. Names are arbitrary; they don’t really reflect necessarily what they’re going to be for, accounts can come and go.

So, I don’t necessarily like the use of name. Plus, sometimes it’s hard to do that if you’re doing, like, [unintelligible 00:04:21] various countries and things like that, various languages. Different things can impart different meanings. Tags also still probably use language
problems, but they are arbitrary values. You know you’re going to try and lump these all together; that’s all that matters.

So, I definitely think that, if we’re using tagging, tagging is going to let you be more concise with your costs, it’s going to let you apply costs across different accounts more readily, it’s going to let you apply costs across different cloud providers, especially if you use one of the CMP tools like CloudHealth, or Cloudcheckr, or something like that and you run production workloads from a single cost center across multiple clouds, you’re going to want to tag those in those tools, so, that way, you can keep a consistent track and more concise tracking of costs, versus just using account names. Account names after a while is going to just become unmanageable when it comes to tracking costs.

Amy: I totally agree. And one of the big things that I harp on, especially on this podcast, is that if you’re worried that it’s not going to be as explicit as other billing methods, you will still at least have that data. You will still know per resource—if it’s properly tagged—who it’s supposed to be charged to and who owns it. You would make that decision on an architectural level, you should also make it for your bill, just to make sure that if you ever need that information in the future, you can go get it. You’re not going to get it—since they don’t happen retroactively, then you may as well do it as early as possible.

Jesse: Yeah. It’s super frustrating that a lot of this information is not available retroactively. And while I understand the technical limitations to that, I can’t harp enough why starting to tag resources early is super, super critical to understanding that spend, and using that tagging setup, that tagging policy, to better understand your spend in a number of different ways. But again, I also want to call out that, like, I've been saying everything about tagging related to spend, there are other ways that tags can be beneficial to your
organization. I’ve seen organizations where security needs to know, are all of the containers that were running patched to a certain level?

Are all of the AMIs that we’re running patched to a certain level? Tags can do that; tags can help you understand what resources are using a certain AMI version, or a certain container version, or other security pieces that are important for security to know and be able to understand that all of these resources are patched to the latest available version of whatever we’re looking at. One of the things that we talk about a lot in this podcast is having conversations with other teams because I feel like cloud cost management is not just an engineering responsibility. It’s a responsibility of finance, and product, and security, and IT because there’s all sorts of different groups that may ultimately be using the cloud. And that’s kind of important for everybody to be on the same page in terms of how you’re using the cloud. And so it’s not just about tagging so, you can know the cost of something, but tagging so that you can know all these other important things like security, like product details, like maybe IT details, all these other different use cases for different departments that are also involved in cloud usage.

Corey: This episode is sponsored in part by ChaosSearch. You could run Elastic Search or Elastic Cloud or Open Search, as they’re calling it now, or a self hosted out stack. But why? ChaosSearch gives you the same API you’ve come to know and tolerate, along with unlimited data retention and no data movement. Just throw your data into S3 and proceed from there as you would expect. This is great for IT operations folks, for App performance monitoring, cyber security. If you’re using ElasticSearch consider not running ElasticSearch. They’re also available now on the AWS market place, if you prefer not to go direct and have half of whatever you pay them count toward your EDP commitment. Discover what companies like Hubspot, Clarna, Equifax, Armor Security and Blackboard already have. To learn more visit chaossearch.io and tell them I sent you just so you can see them facepalm yet again.

Tim: Yeah. I think there’s this idea that comes, I think, from very legacy data center operations where you’re going to use an account name to, kind of, specify what it does and where it comes from in the same way that you would use, like, a host naming scheme to define what a computer is and what it does and things like that. And I think that can be practical, but sometimes it’s often short-sighted, especially as an organization grows, and you create more accounts, and you bubble up other accounts [unintelligible 00:08:21] accounts. It comes time to sign the EDP and you need to have a master payer account, you acquire some other accounts and things like that, and then all of a sudden, whatever naming schemes they used is now integrated into what your naming scheme is. And that becomes, maybe, unmanageable.

So, I’ve always preferred to have account names. I mean, if you need to have it specified, understand it’s going to just be for humans to, really quick, find it, but I’m just as content to have an account name be a UUID and then have some other kind of method for looking at what it does or assigning billing to it. Because in the end, like I said, I prefer to use tagged resources to define what they are and where they go. They are obviously going to be exceptions made for things that are, like, dev, test, UAT, or something like that, where [unintelligible 00:09:06] are different, but we’re still talking about changes on an account, and then you make the changes on the account as you need. And then if it’s for production, then obviously those accounts can be tagged as production. They don’t have to necessarily be named production.

Amy: Right, and I think, security boundaries and resource permissions aside, if you’re just looking at trying to track costs to a resource, an account ID is really just one piece of information as opposed to tags, where you can just overload it with as much information as you need.

Jesse: Absolutely. Now, one other thing that I do want to talk about is we’re talking about a lot of good use cases for tags. We should also talk about some of the not-so-good use cases for tags, or some of the not-so-great best practices for tags that we have seen. Amy, I know specifically you had some examples that you want to talk about.

Amy: Yes. [laugh]. So, this comes from having to do data normalization, back in the day. First thing you never want to do when developing your tag strategy is you want it to just determine things like casing, or whether or not you’re allowed to use spaces because I’ve seen in different places, not just on resource tagging, but also the way information is meta-keyed, where they have their key name identical to a completely different key name, like you have ‘product owner’ except ‘product owner’ is capitalized in one instance and not capitalized in another instance, and these are considered to be different things within the system. Whether or not that’s your intention, they will show up as different things on some visualizations. On other visualizations, they will get normalized and turned into the same thing. So, it really depends on what it is that you want your reports to look like and what you want these resources to be able to tell you.

Jesse: Yeah, that’s a really great point that one of the things that we haven’t potentially touched on for this episode, and is covered in a number of other podcast episodes and blog posts in general is a good tagging strategy is equally as important as tagging coverage. Knowing that the tags should all be uppercase or all lowercase, or use these types of characters and not those types of characters is equally as important as making sure that those tags are applied accurately across all of your resources. So, as you are talking about tagging, as you are thinking about tagging, even in the multi-account situation, it is important to think about, what are the best practices? What are the standards that you want for your tagging? And again, this may not be a conversation that you have in a silo by yourself; this may be a conversation that you have with a number of other teams because there may be a number of other teams that need certain information from tags and need to use certain letters or special characters. And you need to incorporate all of that; you need to include all of that in the tagging policy that you create.

Tim: I think it’s also important, though too, that with most analytics tools, even if it’s just, you know, Cost Explorer within AWS Console, you can still aggregate those tags together, especially if you’re doing costs, you can absolutely aggregate multiple cases and things like that. CloudHealth, I know you can select multiples or anything that matches a pattern regardless of case and do it that way. So, it is possible to work around those mistakes. It’s not a, “Oh, we didn’t have our tagging schema set up correctly, so, throw your hands up and give up.” It’s just something else you have to consider, and hopefully, you can normalize going forward.

Jesse: Yeah, absolutely.

Amy: And really, the other thing is to make sure that the tags that you choose makes sense for what you’re doing. So, if you are tagging the environment and that is the only tag that you put on a resource, then just know that when you start pulling things up in Cost Explorer or Cost and Usage Report, that’s the only thing you’re going to see. So, you’re only going to see things split up between your production account and your dev account; you’re not going to be able to see what service is actually costing you more money, or what storage, as associated to a team, has suddenly decided to grow beyond the usual predictive usage patterns.



Jesse: Yeah, we have some recommendations we can make if you are just getting started on your tagging journey, and I will make sure that information is shared in the [show notes 00:13:53]. But ultimately, again, it becomes a strategy conversation. It becomes a question of what are you trying to accomplish? What are the goals that you’re trying to accomplish? What is the information that you want out of tagging? Because that’s ultimately going to drive what you tag and why you tag.

All right, that’ll do it for us this week, folks. If you’ve got questions you’d like us to answer, please go to lastweekinaws.com/QA, fill out the form and we’d be happy to answer your question on a future episode. If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review. Give it a five-star rating on your podcast platform of choice and tell us what are the most important things that you focus on in your tagging strategies? What are the things that you tag for your company?

Announcer: This has been a HumblePod production. Stay humble.

Fri, 02 Jul 2021 03:00:00 -0700
I Scored 81% on my AWS Certification Exam, Locking in my re:Invent Lounge Pass

Want to give your ears a break and read this as an article? You’re looking for this link[https://www.lastweekinaws.com/blog/I-Scored-81%-on-my-AWS-Certification-Exam,-Locking-in-my-re:Invent-Lounge-Pass



Never miss an episode



Help the show



What's Corey up to?

Wed, 30 Jun 2021 03:00:00 -0700
The Wickr Managed Service
AWS Morning Brief for the week of June 28, 2021 with Corey Quinn.
Mon, 28 Jun 2021 03:00:00 -0700
Should I Attend re:Invent?

Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.


Jesse: Hello, and welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.

Tim: I’m Tim Banks.

Jesse: This is the podcast within a podcast where we talk about all the ways that we’ve seen AWS used and abused in the wild, with a healthy dose of complaining about AWS for good measure. Today on the show, we are going to be talking about AWS re:Invent. Now, I know that most of you know what re:Invent is, but I just would love to set the playing field level for everybody really quick. Amy, Tim, what is AWS re:Invent.

Tim: AWS re:Invent is AWS’s week-long corporate conference. It’s not really a user conference; it’s certainly not, like, a community conference, but it’s a week-long sales pitch in the desert. It’s like the worst version of a corporate Burning Man you could ever imagine because they even have a concert.

Jesse: It is in Las Vegas. Now, I personally have mixed feelings about going to Las Vegas in general, but this adds so much to the conference in general because it’s not just in a single conference venue that’s centrally located near the hotels. Is it is across the strip—

Amy: It’s the entire strip.

Jesse: It’s the entire strip. So—

Amy: They block every hotel and they buy every piece of ad space.



Jesse: Yes. There is no escaping AWS re:Invent for the entire week that you’re there. And sometimes that’s a good thing because you do want to be involved in what’s going on, but other times, it is a lot.

Tim: So, I’m trying to figure out which LP that ‘buy the entire Las Vegas trip’ covers because it’s certainly not be frugal.

Amy: No. [laugh].

Jesse: No, not at all. But we do have new information. We decided to do this episode specifically because new information was just released about re:Invent for this year. Amy, what is that information? What do we know?

Amy: They’ve decided, in having to go virtual last year, due to some kind of horrible global crisis, to return in person to the world’s most densely packed tourist spot, Las Vegas, and host this huge event from November 29th to December 3rd—that’s right after Thanksgiving—and just, what do they say? Return to normal. Return to normal.

Tim: That way everybody can get exposed to COVID before they go home for the holidays.

Jesse: [laugh].well, you at least get one holiday in, if you celebrate or recognize Thanksgiving, and then you get to bring everything back after that.

Amy: Yeah, people bring enough things back from Vegas. I’m not sure we’d have to find more reasons. [laugh].

Tim: [laugh].

Jesse: I know that there’s that great marketing tactic of, “What happens in Vegas stays in Vegas,” but—

Tim: That’s not what they say at the clinic.

Jesse: Nope. Mm-mm. Now, I will say, I know that almost every conference event was completely virtual last year due to the pandemic, and this year, a lot of conferences are still trying to straddle that line between what’s acceptable, can we do maybe smaller events in person, some kind of a hybrid online/in-person thing. I have mixed feelings on this. I appreciate that I can still attend AWS re:Invent from home this year digitally, I can still watch a lot of the main keynote events and a lot of the other information that is being shared, but I don’t know, it’s always hard because if you do a hybrid event, you’re automatically going to miss out on any of that in-person socializing and networking.

Tim: Well. So, I think it’s interesting. AWS re:Invent suffers from the same issue that pretty much all other conferences suffer from is that there’s not really value-add in the talks, at least for attending.



Jesse: Yeah.

Amy: If you’re going to be able to see those talks afterwards if the announcements are going to be publicized afterwards which, that is true in both cases, then what’s the point of spending the money, and the time, and the possible exposure to go watch them in person? So, then the other thing is, “Well, we want to go for some of the training seminars,” or some of these other things. Well, those are also offered online, often. Or, like, copies of them online. These are the same kinds of tutorials like that that you can have your TAM or SA run if you’re an AWS customer currently; that’s what they’re doing there.

The other thing is, too, those in-person sessions get filled up so quickly that there’s no guarantee [unintelligible 00:05:08] anyways. And that’s one of the complaints they’ve had about re:Invent in the past is that you can’t get into any of the sessions. And so, you couple all that along with most of the reason going being—if it’s not the talks and is not the sessions, it’s the hallway track. And then you got to kind of wonder, is the hallway track going to be valuable this year because if it’s hybrid, what percent of the people that you would normally
talk to you are going to be there and what percentage aren’t? And so there’s a lot of calculus that’s got to go into it this year.

Jesse: I’ve always struggled with any vendor-sponsored event, all the talks feel either like a sales pitch, or they feel like a use case that just doesn’t fit for me. And that may just be where I’m at in my professional journey; there’s definitely reasons to go if you want to see some of these talks or see some of this information live, or be the first person to talk about it. Or even the people who are going to be the news sources for everybody else who want to be the first person to talk about, “Oh, we attended, and we saw these things and were live-tweeting the entire conference.” If that’s your shtick, I fully support that, but I always struggle going to any kind of vendor conference because I just feel like the value that I get from the talks, from training if I go to training, just doesn’t feel like enough for me, personally.

Amy: So, I’ve done some of the AWS-led training when Summit was in Chicago, a couple years ago, and I’ll be honest, you lose a lot in these large AWS-led trainings because these classes, it’s not going to be like the ones that you would sign up for even being hosted either by your company or by your local user group chapter where you will have at max 100 people. You have well over that. You have an entire conference room full of people, and they’re asking questions that are across the level of expertise for that topic. I went for one of the certification training seminars and straight-up 15 minutes was spent talking about what a region is. And given that’s page one of any training material, that was a waste of $300.

Jesse: Yeah.

Tim: I think you run into the problem because it is, in fact, I mean, let’s be honest, it’s a multi-day sales pitch. It’s not a user conference, it’s not user-generated content. It’s cherry-picked by the powers-that-be at AWS, the service groups. Is a big push for account executives to encourage high-level or high-spend accounts to participate in those so they get logo recognition. And so that becomes more of the issue than the actual cool user stories.

And that’s fine if you’re using it literally just a sales conference because it’s very compelling sales material, your account executive will go there and try and close deals, or close bigger deals, or sign EDPs or something like that, but from an engineering standpoint, from a technical standpoint, it’s remarkably uncompelling.



Jesse: Yeah, I think that’s one other thing to call out, which is, there is definitely this networking opportunity that we talked about from a hallway track perspective, but there’s also a networking and business opportunity to meet with your account manager, or your TAM, or your SA in person and have conversations about whatever things you want to talk about; about future architecture, or about closing an EDP—or I should say, about an EDP because the account manager will try to close that EDP with you—and then basically use that as next steps for what you want to do with AWS. But again, all of those things can be done without flying you to Las Vegas and being amongst all these other people.

Tim: I mean, let’s not take away, there’s a certain synergy that happens when you have face-to-face contact with folks, and a lot of these conversations you have in hallways are super, super organic. And so I think that’s indicative of conferences as a whole. One of the things that we learned in the pandemic is that, yeah, you can have talks where people just, like, look at a screen and watch talks, and a lot of conferences have done that. But that’s not why people want to go to the conference; they want to go to the conference to talk to people
and see people. And if you want to have a conference where people talk to people and see people, and that’s the whole point of doing it,
then the business model behind that looks dramatically different, and the content behind that looks dramatically different.

You just have a bunch of birds-of-feather sessions or a bunch of breakout sessions. You do a keynote at the beginning, you do a keynote at the end, and then you just let people mingle, and maybe you have some led topics, but you don’t generate content; you shut up and you let the people innovate.

Jesse: I also want to add to that. It is one thing to have a conference that is in one venue where everybody is going to be gathered in the same space, creating conversation, or creating easy opportunities—

Amy: Five miles worth of content isn’t exciting for you?

Jesse: Yeah. So, in Las Vegas because the entire conference is spread across the entire strip, you’re going to have opportunities to network across the entire strip basically, and sometimes that means you’re going to only spend time networking with the people who are in the same hotel as you at the time of the track that you are waiting for, or the time of the event that you are waiting for. It is unlikely that you are going to run all around the strip just to be able to network with everybody that you run into.

Corey: This episode is sponsored in part by our friends at Lumigo. If you’ve built anything from serverless, you know that if there’s one thing that can be said universally about these applications, it’s that it turns every outage into a murder mystery. Lumigo helps make sense of all of the various functions that wind up tying together to build applications. It offers one-click distributed tracing so you can effortlessly find and fix issues in your serverless and microservices environment. You’ve created more problems for yourself; make one of them go away. To learn more visit lumigo.io.

Amy: The other issue I have, not just with re:Invent, but this is really any larger conference or conferences that rely on the kind of content where it is a person speaking at you and you don’t get to meet these people, is that without any level of Q&A or interactivity—and this is true especially for AWS-led events—is that it is no different than watching someone on video. You can go to these talks, and you can perhaps have conversations with people as they filter out of the room, but there’s no way you’re going to be able to talk to that person who was delivering that content, unless you can track them down amongst the sea of people in re:Invent or [unintelligible 00:12:16] in Las Vegas.

Tim: What typically has to happen is that after someone has given a compelling talk and you really want to talk to them, you have to go and talk to your account manager; your account manager will then set up a meeting that will happen at a later time where you’re going to all call in over Chime, and then you will quote-unquote, “Meet” that person virtually. And if that’s the case, you could have just stayed home and watched [laugh] the talk online, and then done the same thing.



Amy: Conferences need more Chime. That’s what [laugh] the problem is.

Jesse: [laugh]. I think my eye just started twitching a little bit as soon as you said that, Amy.

Amy: I’m glad. So, then why would people go? There’s the hallway track, but is that worth the heavy price tag of going to Vegas? A lot of us live in areas where there is either going to be an AWS Summit or there are AWS user groups. What do you get from going to a larger
event such as re:Invent and having that level of communication that you can’t get from those smaller groups?

Tim: I mean, the importance of networking cannot be overstated. It is extremely important, whether it’s for laying groundwork for future deals, laying groundwork for future collaborations. I’ve been at conferences where a hallway track, just folks meeting up in the hallway and having a really organic discussion turned into a product within three months. So, those kinds of things are important. And, unfortunately or unfortunately, they do happen better quite often, when people are in-person and they’ve had a chance to talk, maybe even a couple of drinks or whatever.

So, I mean, people ink deals, they shake hands, they get, you know, a lot of work done when it comes to maintaining and managing relationships, and to some people, that is worth it. But I do think that you have to be very, kind of, eyes-open about going into this. It’s like, you’re not going to go in there to get a lot of technical insight, you’re not going to go in there to talk to a whole bunch of people unless you really have a relationship or establish some kind of rapport with them beforehand. Because just to go up and blindly like, “Hey, I’m going to grab you in the hallway, and this is who I am,” that’s not always great, especially nowadays, when people are, kind of, already averse to, you know, talking to strangers, sometimes.

Jesse: I’ve always struggled with talking to strangers in general at conferences because I’m predominantly introverted, so if I don’t have an open introduction to someone through a mutual third party or mutual friend, it’s just not going to happen. And I’ve gotten better at that over the years as I go to conferences, but it’s going to be especially tough now in cases where folks are not just averse to, I don’t want to say strangers, but averse to physical contact and adverse to people just, kind of, approaching them out of the blue. It’s tough. I want to be more mindful of that and I want to be better, but it’s hard, especially in cases where you’re in a crowd of hundreds of people or, you know, thousands of people across the strip, that it just gets overwhelming really quickly for some folks.

Amy: I do want to loop this round, if anything, just for a poll for Twitter. Do not close an EDP in Vegas. You’re probably not of the right mind [laugh] and have the right people to do that. Wait till you get back to work. Please. That’s just me. [laugh].



Jesse: I would also like to add—we talk about why people go; I think that there’s definitely a solid contingent of folks who attend re:Invent because it is the one time a year that the company sanctions them getting away from their family for a couple of days, getting away from, you know, the day-to-day routine of whatever work is going on for a couple days, and go to Vegas. Now, I know that the company is not going to sponsor them drinking every night, or gambling, or whatnot, but they’re likely going to be doing those things anyhow, so it is this company-sanctioned opportunity to just go experience, you know, something different; go take a vacation, basically, for a couple days.

Amy: Corporate Burning Man.

Tim: Corporate Burning Man, exactly. A vacation in Vegas.

Amy: I am not a fan of ever working in Vegas. If I’m on the clock, I cannot be in Vegas, not because I’m prone to excessive behavior when I’m on my own, but more that I cannot be productive in that much noise and that much flaky internet. It drives me absolutely batty, and I’m
only going to be, as far as implementations, so productive in a crowd that large.

Tim: I will say this, especially in regards to Vegas, there are other places you can go, other places that need the money more. AWS wants to rent a city, rent a city that needed the money. Put that money where it could be to used, where it really makes a difference. I don’t know if Vegas is the right place for that, if I’m being honest, especially after all we’ve learned and dealt with in 2020. And so that’s why in 2021, yeah, no for me, continuing to have re:Invent in Vegas is very, very tone-deaf.

Jesse: I still think, Amy, you and I just need to—actually sorry, all three of us should attend and basically keep a running Waldorf and Statler commentary through the entire conference. I don’t know if we can get that little, you know, opera booth that’s kind of up and away from all the action, but if we can get something like that and do some sports commentary—ohh, maybe on the expo hall—

Amy: That would be great. That would be great if we don’t get banned. [laugh].



Tim: I think what would be even more fun is to give a MST3K—

Jesse: Ohhh.

Tim: —treatment of the keynotes afterwards, you know what I mean?

Jesse: Yeah.

Amy: Yes.



Jesse: I mean, Amy and I had also talked about playing some Dungeons and Dragons while we were there, and I feel like if we can find some, I’m going to say, tech-themed RPG—I realize that is a broad category, and everybody’s going to spam me afterwards for this, but—



Amy: I got that. Don’t worry about it.

Jesse: Yeah, I’m on board. I feel like anything that we can do to create a roleplaying game out of this conference, I’m down.



Tim: I’m still waiting for you to explain to the audience in general who Waldorf and Statler were?

Jesse: Oh, yes, that is fair. Okay. Waldorf and Statler are two characters from the old-school Muppets Show, which is amazing and delightful. It’s on Disney+; I highly recommend it. They are basically—

Amy: They’re two grumpy old muppets, and they have been roasting people since the 70s. That is—that’s all it is. [laugh].

Tim: All they do is they sit up in the upper booth and they throw shade, and I love it.

Amy: Yes. And they just show up in random parts in different movies. They’ll be, like, on a park bench, and there’ll be a serious moment, and then they’ll just start talking crap for no reason. And it’s great.

Jesse: They’re the best. They’re absolutely fantastic. I adore them. I hope to be them one day.

Amy: One day.



Tim: Really, both of them? I don’t, I don’t know how that’s going to work.



Jesse: I am hoping to clone myself. One of me is going to have fabulous hair and one of me is going to be balding. Probably the clone is going to be balding; sorry about it, future me. But—

Amy: [laugh].

Tim: Well, I mean, and have just a magnificent chin, right?

Jesse: Yes, yes, that’s the trade-off. Losing the hair up top but absolutely fantastic chin.

Tim: Here’s what I want to see. I want to see the listeners submit things that you think should be on the re:Invent bingo cards.

Amy: Ohh, yes.

Jesse: Yes.

Amy: I would love to see that.

Jesse: So, for those of you listening, you’ve got two options for submitting things that you’d like to be on the re:Invent bingo cards. The ideal option is going to lastweekinaws.com/QA. Fill out the form and let us know what you think should be on the bingo card. You can also respond to the social media post that will be posted for this content, and we can take a look at that as well. But that’ll be a little bit harder for us to follow because I’m unfortunately not like Corey. I can’t absorb all of Twitter in a day; it takes me a longer time to read all that content.



Jesse: If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review. Give it a five-star rating on your podcast platform of choice and tell us what you think about AWS re:Invent.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 25 Jun 2021 03:00:00 -0700
The Cloud Genie

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/the-cloud-genie



Never miss an episode



Help the show



What's Corey up to?

Wed, 23 Jun 2021 03:00:00 -0700
Consistently Crashing EC2 Instances
AWS Morning Brief for the week of June 21, 2021 with Corey Quinn.
Mon, 21 Jun 2021 03:00:00 -0700
Listener Questions 6

Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Jesse: Hello, and welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.

Tim: And I’m Tim Banks.

Jesse: This is the podcast within a podcast where we talk about all the ways we’ve seen AWS used and abused in the wild, with a healthy dose of complaining about AWS for good measure. Today is a very special episode for two reasons. First, we’re going to be talking about all the things that you want to talk about. That’s right, it’s time for another Q&A session. Get hyped.



Amy: And second as is Duckbill’s customary hazing ritual, we’re putting a new Duckbill Group Cloud Economist Tim Banks through the wringer to answer some of your pressing questions about cloud costs and AWS. And he has pretty much the best hobbies.

Tim: [laugh].

Jesse: Absolutely.

Tim: You know, I choke people for fun.

Jesse: [laugh]. I don’t even know where to begin with that. I—you know—

Amy: It’s the best LinkedIn bio, that’s [laugh] where you begin with that.

Tim: Yeah, I will change it right after this, I promise. But no, I think it’s funny, we were talking about Jiu-Jitsu as a hobby, but my other hobby is I like to cook a lot, and I’m an avid, avid chili purist. And we were in a meeting earlier and Amy mentioned something about a bowl of sweet chili. And, dear listeners, let me tell you, I was aghast.

Amy: It’s more of a sweet stewed meat than it is, like, some kind of, like, meat candy. It is not a meat candy. Filipinos make very sweet stews because we cannot handle chili, and honestly, we shouldn’t be able to handle anything that’s caramelized or has sugar in it, but we try to anyway. [laugh].

Tim: But this sounds interesting, but I don’t know that I would categorize it as chili, especially if it has beans in it.

Jesse: It has beans. We put beans in everything.

Tim: Oh, then it can’t be chili.

Jesse: Are you a purist that your chili cannot have beans in it?

Tim: Well, no. Chili doesn’t have beans in it.

Amy: Filipino food has beans in it. Our desserts have beans in it. [laugh].

Jesse: We are going to pivot, we’re going to hard pivot this episode to just talk about the basis of what a chili recipe consists of. Sorry, listeners, no cost discussions today.

Tim: Well, I mean, it’s a short list: a chili contains meat and it contains heat.

Jesse: [laugh].

Tim: That’s it. No tomatoes, no beans, no corn, or spaghetti, or whatever people put in it.

Amy: Okay, obviously the solution is that we do some kind of cook-off where Tim and Pete cook for everybody, and we pull in Pete as a special quote-unquote, outside consultant, and I just eat a lot of food, and I’m cool with that. [laugh].

Jesse: I agree to this.

Tim: Pete is afraid of me, so I’m pretty sure he’s going to pick my chili.

Jesse: [laugh].

Amy: I could see him doing that. But also, I just like eating food.

Tim: No, no, it’s great. We should definitely do a chili cook-off. But yeah, I am willing to entertain any questions about, you know, chili, and I’m willing to defend my stance with facts and the truth. So…

Amy: If you have some meat—or [sheet 00:03:19]—related questions, please get into our DMs on Twitter.

Jesse: [laugh]. All right. Well, thank you to everyone who submitted their listener questions. We’ve picked a few that we would like to talk about here today. I will kick us off with the first question.

This first question says, “Long-time listener first-time caller. As a solo developer, I’m really interested in using some of AWS’s services. Recently, I came across AWS’s Copilot, and it looks like a potentially great solution for deployment of a basic architecture for a SaaS-type product that I’m developing. I’m concerned that messing around with Copilot might lead to an accidental large bill that I can’t afford as a solo dev. So, I was wondering, do you have a particular [bizing 00:04:04] availability approach when dealing with a new AWS service, ideally, specific steps or places to start with tracking billing? And then specifically for Copilot, how could I set it up so it can trip off billing alarms if my setup goes over a certain threshold? Is there a way to keep track of cost from the beginning?”

Tim: AWS has some basic billing alerts in there. They are always going to be kind of reactive.

Jesse: Yes.

Amy: They can detect some trends, but as a solo developer, what you’re going to get is notification that the previous day’s spending was pretty high. And then you’ll be able to trend it out over that way. As far as asking if there’s a proactive way to predict what the cost of your particular architecture is going to be, the easy answer is going to be no. Not one that’s not going to be cost-prohibitive to purchase a sole developer.

Jesse: Yeah, I definitely recommend setting up those reactive billing alerts. They’re not going to solve all of your use cases here, but they’re definitely better than nothing. And the one that I definitely am thinking of that I would recommend turning on is the Cost Explorer Cost Anomaly Detector because that actually looks at your spend based on a specific service, a specific AWS cost category, a specific user-defined cost allocation tag. And it’ll tell you if there is a spike in spend. Now, if your spend is just continuing to grow steadily, Cost Anomaly Detector isn’t going to give you all the information you want.

It’s only going to look for those anomalous spikes where all of a sudden, you turned something on that you meant to turn off, and left it on. But it’s still something that’s going to start giving you some feedback and information over time that may help you keep an eye on your billing usage and your spend.

Amy: Another thing we highly recommend is to have a thorough tagging strategy, especially if you’re using a service to deploy resources. Because you want to make sure that all of your resources, you know what they do and you know who they get charged to. And Copilot does allow you to do resource tagging within it, and then from there should be able to convert them to cost allocation tags so you can see them in your console.

Jesse: Awesome. Well, our next question is from Rob. Rob asks, “How do I stay HIPAA compliant, but keep my savings down? Do I really need VPC Flow Logs on? Could we talk in general about the security options in AWS and their cost impact? My security team wants everything on but it would cost us ten times our actual AWS bill.”

Rob, we have actually seen this from a number of clients. It is a tough conversation to have because the person in charge of the bill wants to make sure that spend is down, but security may need certain security measures in place, product may need certain measures in place for service level agreements or service level objectives, and there’s absolutely a need to find that balance between cost optimization and all of these compliance needs.

Tim: Yeah, I think it’s also really important to thoroughly understand what the compliance requirements are. Fairly certain for HIPAA that you may not have to have VPC Flow Logs specifically enabled. The language is something like, ‘logging of visitors to the site’ or something like that. So, you need to be very clear and concise about what you actually need, and remember, for compliance, typically it’s just a box check. It’s not going to be a how much or what percent; it’s going to be, “Do you have this or do you not?”

And so if the HIPAA compliance changes where you absolutely have to have VPC Flow Logging turned on, then there’s not going to be a way around that in order to maintain your compliance. But if the language is not specifically requiring that, then you don’t have to, and that’s going to become something you have to square with your security team. There are ways to do those kinds of logging on other things depending on what your application stack looks like, but that’s definitely a conversation you’re going to want to have, either with your security team, with your product architects, or maybe even outside or third-party consultant.



Jesse: Another thing to think about here is, how much is each of these features in AWS costing you? How much are these security regulations, the SLA architecture choices, how much are each of those things costing you in AWS? Because that is ultimately part of the conversation, too. You can go back to security, or product, or whoever and say, “I understand that this is a business requirement. This is how much it’s costing the business.”

And that doesn’t mean that they have to change it, but that is now additional information that everybody has to collaboratively decide, “Okay, is it worthwhile for us to have this restriction, have this compliance component at this cost?” And again, as Tim was mentioning, if it is something that needs to be set up for compliance purposes, for audit purposes, then there’s not really a lot you can do. It’s kind of a, I don’t want to say sunk cost, but it is a cost that you need to understand that is required for that feature. But if it’s not something that is required for audit purposes, if it’s not something that just needs to be, like, a checkbox, maybe there’s an opportunity here if the cost is so high that you can change the feature in a way that brings the cost down a little bit but still gives security, or product, or whoever else the reassurances that they need.

Tim: I think the other very important thing to remember is that you are not required to run your application in AWS.

Jesse: Yeah.

Tim: You can run it on-premise, you can run at a different cloud provider. If it’s going to be cost-prohibitive to run at AWS and you can’t get the cost down to a manageable level, through, kind of, normal cost reduction methods of EDPs, or your pricing agreement, remember you can always put that on bare metal somewhere and then you will be able to have the logging for free. Now, mind you, you’re going to have to spend money elsewhere to get that done, but you’re going to have to look and see what the overall cost is going to be. It may, in fact, be much less expensive to host that on metal, or at a different provider than it would be at AWS.

Corey: This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the Enterprise (not the starship). On-prem security doesn’t translate well to cloud or multi-cloud environments, and that’s not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35 percent faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com/trial.

Jesse: Our next question is from Trevor Shaffer. He says, “Loving these Friday from the field episodes and the costing”—thank you—“I’m in that world right now, so all of this hits home for me. One topic not covered with the cost categorization, which I’m tasked with, is how to separate base costs versus usage costs. Case in point, we’re driving towards cost metrics based on users and prices go up as users go up. All of that makes sense, but there’s always that base load required to serve quote-unquote, ‘no users.’

“The ALP instance hours, versus the LCU hour, minimum number of EC2 instances for high availability, things like that. Currently, you can’t tag those differently, so I think I’m just doomed here and my hopes will be dashed. For us, our base costs are about 25% of our bill. Looking for tricks on how to do this one well. You can get close with a lot of scripting and time, teasing out each item manually.” Trevor, you can, and I also think that is definitely going to be a pain point if you start scripting some of these things. That sounds like a lot of
effort that may give you some useful information, but I don’t know if it’s going to give you all of the information that you want.

Tim: Well, it’s also a lot of effort, and it’s also room for error. It won’t take but a simple error in anything that you write where these costs can then be calculated incorrectly. So, that’s something to consider as well: is it worth the overall costs of engineering time, and maintenance, and everything like that, to write these scripts? These are decisions that engineers groups have to make all the time. That said, I do think that this is, for me I think, one of the larger problems that you see with AWS billing is that it is difficult to differentiate something that should be reasonably difficult to differentiate.

If I get my cell phone bill, I know exactly how much it’s going to cost us to have the line, and then I can see exactly how much it’s going to cost me for the minutes. The usage cost is very easily separated from—I’m sorry, the base cost is very easily separated from the usage cost. It’s not always that way with AWS, I do think that’s something that they could fix.

Jesse: Yeah, one thing that I’ve been thinking of is, I don’t want to just recommend turning things on and measuring, but I’m thinking about this from the same perspective that you would think about getting a baseline for any kind of monitoring service: as you turn on a metric or as you start introducing a new metric before you start building alerts for that metric, you need to let that metric run for a certain amount of time to see what the baseline number, usage amount, whatever, looks like before you can start setting alerts. I’m thinking about that same thing here. I know that’s a tougher thing to do when this is actually cost involved when it’s actually costing you money to leave something on and just watch what usage looks like over time, but that is something that will give you the closest idea of what base costs look like. And one of the things to think about, again, is if the base costs are unwieldy for you or not worthwhile for you in terms of the way the architecture is built, is there either a different way that you can build the architecture that is maybe more ephemeral that will make it cost less when there are no users active? Is there a different cloud provider that you can deploy these resources to that is going to ultimately cost you less when you have no users active?

Tim: I think too, though, that when you have these discussions with engineering teams and they’re looking at what their priorities are going to be and what the engineering cost is going to be, oftentimes, they’re going to want metrics on how much is this costing us—how much would it cost otherwise? What is our base cost, what’s our usage cost?—so that you can make a case and justify it with numbers. So, you may think that it is better to run this somewhere else or to re-architect your infrastructure around this, but you’re going to have to have some data to back it up. And if this is what you need to gather that data, then yeah, it is definitely a pain point.

Amy: I agree. I think this is one of those cases where—and I am also loath to just leave things on for the sake of it, but especially as you onboard new architectures and new applications, this should be done at that stage when you start standing things up and finalizing that architecture. Once you know the kind of architecture you want and you’re pushing things to production, find out what that baseline is, have it be part of that process, and have it be a cost of that process. And finally, “As someone new to AWS and wanting to become a software DevOps insert-buzzword-here engineer”—I’m a buzzword engineer—“We’ve been creating projects in Amplify, Elastic Beanstalk, and other services. I keep the good ones alive and have done a pretty good job of killing things off when I don’t need it. What are your thoughts on free managed services in general when it comes to cost transparencies with less than five months left on my free year? Is it a bad idea to use them as someone who is just job hunting? I’m willing to spend a little per month, but don’t want to be here with a giant bill.”

So, chances are if you’re learning a new technology or a new service, unless you run into that pitfall where you’re going to get a big bill as a surprise and you’ve been pretty diligent about turning your services off, your bill is not going to rise that much higher. That said, there have been a lot of instances, on Twitter especially, popping up where they are getting very large bills. If you’re not using them and you’re not actively learning on them, I would just turn them off so you don’t forget later. We’ve also talked about this in our build versus buy, where that is the good thing about having as a managed service is if you don’t need it anymore and you’re not learning or using them, you can just turn them off. And if you have less than half a year on your first free year, there are plenty of services that have a relatively free tier or a really cheap tier at the start, so if you want to go back and learn on them later, you still could.

Tim: I think too, Amy, it’s also important to reflect, at least for this person, that if they’re in an environment where they’re trying to learn something if maintaining infrastructure is not the main core of what they’re trying to learn, then I wouldn’t do it. The reason that they have these managed services is to allow engineering teams to be more focused on the things that they want to do as far as development versus the things they have to do around infrastructure management. If you don’t have an operations team or an infrastructure team, then maintaining the infrastructure on your own sometimes can become unwieldy to the point that you’re not really even learning the thing you wanted to learn; now you’re learning how to manage Elasticsearch.



Amy: Yeah.

Jesse: Absolutely. I think that’s one of the most critical things to think about here. These managed services give you the opportunity to use all these services without managing the infrastructure overhead. And to me, there may be a little bit extra costs involved for that, but to me that cost is worth the freedom to not worry about managing the infrastructure, to be able to just spin up a cluster of something and play with it. And then when you’re done, obviously, make sure you turn it off, but you don’t have to worry about the infrastructure unless you’re specifically going to be looking for work where you do need to manage that infrastructure, and that’s a separate question entirely.

Amy: Yeah. I’m not an infrastructure engineer, so anytime I’m not using infrastructure, and I’m not using a service, I just—I make sure everything’s turned off. Deleting stacks is very cathartic for me, just letting everything—just watching it all float away into the sunset does a lot for me, just knowing that it’s not one more thing I’m going to have to watch over because it’s not a thing I like doing or want to do. So yeah, if that’s not what you want to do, then don’t leave them on and just clean up after yourself, I suppose. [laugh].

Tim: I’ll even say that even if you’re an infrastructure engineer, which is my background, that you can test your automation of building and all this, you know, building a cluster, deploying things like that, and then tear it down and get rid of it. You don’t have to leave it up forever. If you’re load testing an application, that’s a whole different thing, but that’s probably not what you’re doing if you’re concerned about the free tier costs. So yeah, if you’re learning Terraform, you can absolutely deploy a cluster or something and just tear it back out as soon as you’re done. If you’re learning how to manage whatever it is, build it, test it, make sure it runs, and then tear it back down.

Jesse: All righty, folks, that’s going to do it for us this week. If you’ve got questions you would like us to answer, please go to lastweekinaws.com/QA, fill out the form and we’d be happy to answer those on a future episode of the show. If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us whether you prefer sweet chili or spicy chili.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 18 Jun 2021 03:00:00 -0700
The Trillion-Dollar Paradoxical Arguments of a16z

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/Trillion-Dollar-Cloud



Never miss an episode



Help the show



What's Corey up to?

Wed, 16 Jun 2021 03:00:00 -0700
Kinesis Data Increased-Ambient-Temperature Hose
AWS Morning Brief for the week of June 14, 2021 with Corey Quinn.
Mon, 14 Jun 2021 03:00:00 -0700
Cloud Cost Management Team Starter Kit

Transcript


Corey: This episode is sponsored in part byLaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visitlaunchdarkly.com and tell them Corey sent you, and watch for the wince.

Jesse: Hello, and welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.

Jesse: This is the podcast within a podcast where we talk about all the ways that we have seen AWS used and abused in the wild, with a healthy dose of complaining about AWS for good measure. I feel like it’s just kind of always necessary. There always has to be just that little bit of something extra; it’s the spice that really makes the dish. Today we’re going to be talking about the ‘Cloud Cost Management Team Starter Kit.’ Now, in a previous episode, we talked about the ‘Cloud Cost Management Starter Kit,’ which was a little bit more generalized, and one of the things that we talked about, ultimately, was building a team that is responsible for some of this work, some of this cloud cost management work.

So today, we’re going to take that one step further; we’re going to talk about all of the things that your cloud cost management team should ultimately be responsible for, what it should look like, how you might want to start building that team within your organization. So, I’m going to kick us off. I think one of the first things that is so, so critical for any team that is going to be doing any work is buy-in at the executive leadership level. You need to make sure that engineering leadership, the C-suite leadership has your back in everything that you’re doing. You need to make sure that the work that you’re doing has been signed off at the highest level so that that leadership can help empower you to do your work.

Amy: And we’ve referenced this before, and really, every time we talk about things like what makes a successful project is that as the one executing that project, you probably need the authority and actionable goals in order to do that, and the leadership is going to be the ones to lay that out for you.

Jesse: Absolutely. If you don’t have the backing of leadership, whether it is your boss, whether it is the C-suite, whether it’s a VP suite, you’re not going to get other people to listen to what you have to say; you’re not going to get other people to, broadly speaking, generally speaking, care about the work that you’re trying to do, the work that you’re trying to incentivize and empower other people in the organization to do.

Amy: And that kind of leads us into the next portion of it where you need to know what the responsibilities are and have that clear delineation so that you understand the things that is expected of you, what the engineering teams, what they’re expected to do, and product teams, and finance teams. Everyone has to have a pretty much fenced-in idea of what they’re allowed to do and what they are expected to deliver, just like in any project.

Jesse: Absolutely. It’s so critical for me to understand what I’m responsible for, you to understand what you’re responsible for. I can’t tell you how many times I’ve been in a meeting where somebody will say something generally like, “We should do X,” and then everyone nods and goes, “Oh, yeah, yeah, yeah. We should do X.” And then everybody leaves the meeting and thinks that somebody else is responsible for it, and nobody’s been clearly assigned that work, or nobody knows that work is ultimately their responsibility.

Amy: And if you don’t assign it, people are going to assume that this is going to be a thing that if they have time to, they’ll get to it. And we harp on it enough that whenever work is not prioritized, it is automatically deprioritized. That’s just the way task lists shake out, especially at the end of sprint meetings.

Jesse: Absolutely. And I think that’s one of the other things that’s so important, too, is that it’s not just about assigning the work, but it’s about making sure that everybody who is involved in the conversation, everybody who’s involved in the work agrees on what those boundaries are, agrees on who is responsible for what actions, more specifically speaking from a task responsibility perspective. Because at the end of the day, I want my team, whether that is my individual team or a cross-functional team, to all be bought into who’s responsible for what parts of the project. We all need to be on the same page in terms of, “Yes, this is my responsibility. This part of the work is my responsibility. I will take ownership over this,” so that we can all help each other.

Get that project goal together. One of the other big ideas that is so critical to starting a cloud cost management team is identifying and socializing your business KPI metrics. Now, this is something that some engineering teams already think about day-to-day. They might have ideas of service-level agreements, metrics, maybe service-level objective metrics, but there might be other business metrics that indirectly—or directly—relate to engineering work. It could be number of users using your SaaS platform, it could be number of API requests, it could be the amount of storage that customers are storing on your platform. You want to identify what these metrics are, and start measuring your cloud spend against these metrics.

Amy: And as far as cost optimization projects go, the KPIs may not line up directly against how many servers you’re standing up, or how many users are coming through. They’ll be very indicative because you are spending money per user and per resource, but perhaps your business goals are different. Maybe you’re not looking at trying to save money, but better understand where that money is going.

Jesse: Absolutely. It’s not just about how many instances are running per hour, it’s not just about how many servers are running per hour, or how many users per server. It’s really about understanding what are the core driving indicators of your business? What are the things that ultimately influence and impact how your workloads, and servers, and API functions, and everything, flow and grow and change over time?

Amy: These metrics also can be influenced by things that are not architecturally specific, like savings plans, or the saving you would get through reservations, or some other contractual deal you get from your provider.

Jesse: Yeah, that’s one of the hard things, too, that we always hear from our clients. There is this idea that they think that they are spending a certain amount of money because they’re getting discounts from savings plans, or from reserved instances or from an enterprise discount program, and maybe their usage is a lot higher than that, but because they get these discounts, they think that they’re actually using a lot less than they actually are. And while this is not something we’re talking about specifically or directly in this conversation, it is something to be mindful of because there definitely can be a difference between your usage and your overall spend if your company is investing in things like savings plans, and reserved instances, and discounts through either a private pricing addendum or an enterprise discount program.

Amy: Yeah. Really, the bottom line with that is you want to be aware of what your business’s goals are—and this goes back into buy-in, this goes back to leadership—that having a fully contextualized understanding of what it is that they want to do will help you make the right decisions and define your metrics in a way that basically helps you try to set your goals.

Jesse: Absolutely. And all of this comes together in policies and best practices. All of this can come together in a way where you, your team, your cloud cost management team can put all of these ideas and all these things that everybody is agreeing to, into writing. Make sure that everybody is bought in and then write it down; make it an artifact and say, “Okay, after this meeting, we’ve agreed that the way that we are going to handle cross-availability zone traffic is like this,” or, “The way that we are going to handle scaling is like that,” or, “The best practices that we want for storage is this.” Put all this down in writing.

Make sure that there are best practices being created. There’s a number of clients that I’ve worked with before that have seen multiple different teams using the same service but using it in different ways. And maybe one of them has encryption and compression enabled and they’ve got this really tight turnaround time for their services, and another team doesn’t. They’re using a lot more data transfer because they aren’t focusing on compression, for example. And this is an opportunity for a best practice to get everybody on the same page and say, “Okay. If you’re going to use this particular type of service, you need to have compression enabled, you need to architect your services to focus on talking to other services in the same region, in the same availability zone to ultimately try to cut down on data transfer costs, or on storage costs, or other things.”

Corey: This episode is sponsored in part by our friends at Lumigo. If you’ve built anything from serverless, you know that if there’s one thing that can be said universally about these applications, it’s that it turns every outage into a murder mystery. Lumigo helps make sense of all of the various functions that wind up tying together to build applications. It offers one-click distributed tracing so you can effortlessly find and fix issues in your serverless and microservices environment. You’ve created more problems for yourself; make one of them go away. To learn more, visit lumigo.io.

Amy: That brings up a really good point that I’ve noticed when I was actually coding day-to-day that each project and each team is ultimately different because you’re building different things and you’re building it with different people. So, it’s entirely possible that your
KPIs may be different between teams.

Jesse: Yeah.

Amy: But you’re not going to know that unless all the other stuff that we mentioned. And it’s perfectly fine if your KPIs are different between teams, or if your practices have to be modified to better work with what your goals are. And also that best practices, just like everything else in the cloud, can change. If the cloud architecture backend can change once every five minutes, you have to be able to be flexible and say, “These cost management rulings that we made two quarters ago, two years ago, they made sense then. And just like anything else, things evolved, we scaled, and our needs have changed, so we have to review these.”

This is why before, we also mentioned is, like, maybe reviewing it as part of your cloud cost analysis once a quarter because things change all the time. AWS changes all the time. That’s not a thing that I’ve complained about tons of times in different places, with a lot of recorded evidence.



Jesse: Amy, why can I see this giant—your eye is just twitching, just this giant throbbing vein on your forehead right now?



Amy: [laugh]. I have to start recording these podcasts with some kind of blood pressure monitor, and we can see, as soon as I say the word, “AWS starts changing stuff,” and just watch that skyrocket.

Jesse: That is the quote. When our audio engineers post this, that is the quote to use to highlight this episode on social media.

Amy: [laugh]. Yes, absolutely.

Jesse: And I think to really bring this back around, all of these ideas, all of these things that we’re talking about aren’t just about saying we’re going to do the one thing and we’re going to do it that way for all of eternity, like Amy said. Things change over time, and that’s fine. That’s perfectly normal. So, your best practices should change over time, too. Maybe one of the things that you write down as part of your best practices is that you’re going to review your best practices, maybe once a quarter, once every six months, every year, maybe once every, whatever time period works best for your team, the way that your workloads work, the way that your team works, the pace at which your team works, make sure that you’re actively reviewing this information because all of us have seen documentation that is written once and is immediately out of date, and nobody ever touches it ever again. And that’s not what these best practices are about.

Amy: If you want to make sure that teams are bought in, show that you care, that you are aware that this stuff that they work on evolves and changes with them. If you want them to care about cloud cost management policies, it’s hard enough to say much less hard enough to get buy-in. You want them to know that you’re doing it with an awareness of what they are doing and then why they are doing it. You don’t want to go in and say, “We’re making widespread changes. We do not care what happens to your infrastructure because of it.” You want to go, “Because you are running things this way, it has to look like this because the cost is going to look like this.”

Jesse: Absolutely having data to back up your decisions really, really helps in every decision you’re making. It shows that you’re making data-driven decisions; there is a why behind what you’re doing. And it helps other people understand what you’re doing as well.

Well, if you’ve got a question you’d like us to answer on air, please go to lastweekinaws.com/QA. If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review. Give it a five-star rating on your podcast platform of choice and tell us what your ideal starter kit would include.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 11 Jun 2021 03:00:00 -0700
The Key to Unlock the AWS Billing Puzzle is Architecture

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/the-key-to-unlock-the-aws-billing-puzzle-is-architecture



Never miss an episode



Help the show



What's Corey up to?

Wed, 09 Jun 2021 03:00:00 -0700
State Money Printing Machine
AWS Morning Brief for the week of June 7, 2021 with Corey Quinn.
Mon, 07 Jun 2021 03:00:00 -0700
Balancing Cost Optimizations and Feature Work

Links:


Transcript


Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Jesse: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.

Jesse: This is the podcast within the podcast where we like to talk about all the ways we’ve seen AWS used and abused in the wild, with a healthy dose of complaining about AWS for good measure. Today, we’re going to be talking about balancing cost optimization work against feature work.

Amy: Buckle up everyone. I’ve got a lot of thoughts about this. Just kidding. It’s just the one: don’t.



Jesse: You heard it here first, folks. Don’t. Amy Negrette just says, “Don’t.”

Amy: Don’t. [laugh].

Jesse: So Amy, does that mean, don’t balance the work?

Amy: More like don’t choose. It’s always hard to make the argument to take an engineer off of feature work. This goes for
all sorts of support tasks like updates and documentation, and as a group, we figured out that trying to put those off until an engineer has time to do it is not going to be a thing that becomes prioritized, it eventually gets deprioritized, and no one looks at it. And that’s why DocOps is the thing. It’s a process that now gets handled as part of and in parallel with software development.

Jesse: Yeah, I’ve had so many conversations in previous companies that I’ve worked for, where they basically said, “Well, we don’t have time to write documentation.” Or they will say, “The code is the documentation.” And, to their credit, there are a lot of places where the code is very cleanly documented, but if somebody is coming into this information for the first time and they don’t have technical knowledge or they don’t have deep expertise in what you’re looking at, they need documentation that is clear, understandable, and approachable. And it is so difficult to find that balance to actually make sure that that work is part of everything that you do.

Amy: And I think what the industry has decided is that if you make it a requirement for pull requests that if you’re going to make a change, you have to document that change somewhere, and that change if it has any kind of user impact, it will be displayed alongside it. That’s the only way to make it a priority with software. And cost optimization has to be treated in a similar respect.

Jesse: Yeah, so let’s talk about cost optimization as a process. To start, let’s talk about when to do it. Is this something that we do a little bit all the time, or do we do it after everything’s already done?

Amy: I know I just cited CostOps as a good model for this, even though that’s literally what we cannot do. We can’t treat cost optimization as something we do a little bit along the way because, again, speaking as an engineer, if I’m allowed to
over-optimize or over-engineer something, I’m going to take that opportunity to do that.



Jesse: Absolutely.



Amy: And if we’re going to do project-wide cost optimization, we need to know what usage patterns are, we need to have a full user and business context on how any system is used. So, if we do a little at each step, you get stuck in that micro-optimization cycle and you’re never actually going to understand what the impact of those optimizations were. Or if you spent too much time on one part over-optimizing another part.

Jesse: It’s also really hard if this is a brand new workload that you’ve never run in the cloud before. You don’t necessarily know what the usage is going to be for this workload. Maybe you have an idea of usage patterns based on some modeling that you’ve done or based on other workloads that you’re running, but as a whole, if this is a brand new workload, you may be surprised when you deploy it and find out that it is using twice the amount of resources that you expected, or half the amount of resources that you expected, or that it is using resources and cycles that you didn’t expect.

Amy: Yeah. We’ve all been in the situation, or at least if you work with—especially with consumer software—that, you’re going to run into a situation where the bunch of users are going to do things that you don’t expect to happen within your application, causing the traffic patterns that you predicted to move against the model. To put it kindly. [laugh].

Jesse: Yeah. So, generally speaking, what we’ve seen work the best is making time for cost optimization work maybe a cycle every quarter, to do some analysis work: to look at your dashboards, look at whatever tooling you’re using, whatever metrics you’re collecting, to see what kind of cost optimization opportunities are available to you and to your teams.

Amy: So, that comes down to who’s actually doing this work. Are we going to assign a dedicated engineer to it in order to ensure it gets done? Anyone with the free cycles to do it?

Jesse: See, this is the one that I always love and hate because it’s that idea of if it’s everyone’s responsibility, it’s no one’s responsibility. And I really want everybody to be part of the conversation when it comes to cost optimization and cloud cost management work, but in truth, that’s not the reality; that’s not the way to get this work started. Never depend on free cycles because if you’re just waiting for somebody to have a free cycle, they’re never going to do any work. They’re never going to prioritize cost optimization work until it becomes a big problem because that work is just going to be deprioritized constantly. There’s a number of companies that I worked for in the past who did hackathons, maybe once a quarter or once every year, and those hackathons were super, super fun for a lot of teams, but there was a couple individuals who always picked up feature work as part of the hackathon, thinking, “Oh, well, I didn’t get a chance to work on this because my cycles were focused on something else, so now I’ll get a chance to do this.” No, that’s not what a hackathon is about.

Amy: You don’t hack on your own task list. That’s not how anything works.



Jesse: Exactly. So instead, rather than just relying on somebody to have a free cycle, kind of putting it out there and waiting for somebody to pick up this work, there should be a senior engineer or architect with knowledge of how the system works, to periodically dedicate a sprint to do this analysis work. And when we say knowing how the system works, we’re really talking about that business context that we’ve talked about many, many times before. A lot of the cloud cost management tooling out there will make a ton of recommendations for you based on things like right-sizing opportunities, reservation investments, but those tools don’t have the business context that you and your teams do. So, those tools don’t know those resources that are sitting idle in us-west-2 are actually your disaster recovery site, and you actually kind of need those—even though they’re not taking any work right now, you need those to keep your SLAs in check in case something goes down with your primary site.

Or maybe security expects resources to be set up in a certain way that requires higher latency times based on end-to-end encryption. There’s lots of different business context opportunities that a lot of cloud cost management tools don’t have, and that’s something that anybody who is looking at cloud cost optimization work should have and needs to have those conversations with other teams. Whoever does this cloud cost optimization work, or whoever makes the cloud cost optimization recommendations to other teams needs to know the business context of those teams’ workloads so that the recommendations they make are actually actionable.

Corey: This episode is sponsored in part by our friends at Lumigo. If you’ve built anything from serverless, you know that if there’s one thing that can be said universally about these applications, it’s that it turns every outage into a murder mystery. Lumigo helps make sense of all of the various functions that wind up tying together to build applications. It offers one-click distributed tracing so you can effortlessly find and fix issues in your serverless and microservices environment. You’ve created more problems for yourself; make one of them go away. To learn more visit lumigo.io.



Amy: And they should also have the authority to do this work. It’s easy to deliver a team a list of suggestions saying, “Oh, I’ve noticed our utilization is really low on this one instance. We shouldn’t possibly move it,” or what have you. And because they’re not the ones making the full architectural decisions, or leading that team, or in charge of that inventory, they actually don’t have the authority to tell anyone to do anything. So, whoever gets tasked with this really needs to be an architect on that team—if you’re going to go with this embedded resource type of person—where they have that authority to make that decision and to act on it and move things.

Jesse: Yeah. It’s really important that teams stay accountable to the resources that they’re running. And some teams don’t know any of the resources that they’re running; they, kind of, deploy into the cloud as a black box. And that is a perfectly fine business model for some organizations, but then they also need to understand that if the senior engineer or architect who is focused on cloud cost optimization work for this group says, “Hey, we need to tweak some of these workloads or configurations to better optimize these workloads,” the teams need to be willing to have that conversation and be a part of that conversation. So, we’ve talked about a couple different ideas of who this person might be that does this work. This could be a DevOps team that attaches a dedicated resource to doing this analysis work, to making these recommendations, and then delegates the cost optimization work to the engineering teams, or it could be a dedicated cloud economist or cloud economist team who does this work.



Amy: We did touch on having someone in DevOps do this, just because they have a very broad view and the authority to issue tasks to engineering teams because if they see an application or an architecture, where resources are being—or are hitting their utilization cap, or if they realize there are applications that need more or less resources, they’re able to do those types of investigations. Maybe someone on that team can take up this work and have a more infrastructure-minded view on the entire account, see what’s going on on the account and make those suggestions that way.

Jesse: Absolutely. It’s so important. Or if there is a dedicated cloud economist or maybe a cloud economist team that is able to make these recommendations, that has the authority to make these recommendations, maybe that’s the direction your group should go.

Amy: If only we spent an entire podcast talking about this.

Jesse: [laugh]. Huh, if only we spent an entire podcast talking about how to build a cloud cost team and talk about how to
get started as a cloud economist. Hmm…

Amy: Please check out the cloud economist starter kit that we all have already published.



Jesse: Yes, several weeks ago. We’ll post the episode link in the [show notes 00:12:38] again. So, Amy, we’ve talked about when to do this work, who should do this work. What I want to know is how do these teams come together to have these conversations together? I’m thinking about best practices here. I’m thinking about how do teams start building best practices around this work so that each team isn’t working in a silo doing their own cost optimization work?



Amy: If you’re lucky, someone in your company has already done this work. [laugh]. And you can just steal their work.

Jesse: Absolutely.

Amy: Or borrow. Or collaborate. Whatever word you want to use.

Jesse: [laugh].

Amy: See if you can see how the project went, how they structured it. Maybe they ran into a process issue like they weren’t able to get the kind of access they needed without jumping through a whole bunch of red tape and hoops. That’s a good thing to know going into one of these projects, just being able to see the resources that you’re going to be looking at, and making sure you have access to them.

Jesse: Absolutely. This is part of why we also harp so much on open and clear communication across teams about the cloud cost management work that you’re doing. If you are trying to solve a problem, it’s likely that another team in the organization is also trying to solve that same problem, or ideally has already solved that problem, and then they can help you solve the problem. They can explain to you how they solved the problem so that you can solve it faster so you don’t have to waste engineering cycles, trying to reinvent the wheel essentially. It’s a really, really great opportunity to build these best practices, to have these conversations together, maybe to build communities of practice within the
organization, depending on how large your organization is, around the best ways to use these different tools and resources within the organization.

Jesse: Well, that will do it for us this week. If you’ve got questions that you would like us to answer on an upcoming episode, go to lastweekinaws.com/QA. If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us how you integrate cost as a component of your engineering work.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 04 Jun 2021 03:00:00 -0700
Turn That S--- Off

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/turn-that-sh—-off



Never miss an episode



Help the show



What's Corey up to?

Wed, 02 Jun 2021 03:00:00 -0700
AWS Compute Optimizer Now Less Crap
AWS Morning Brief for the week of May 31, 2021, with Corey Quinn.
Mon, 31 May 2021 03:00:00 -0700
Personality Merge Conflicts

Links:

Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Jesse: Hello, and welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.

Jesse: This is the podcast within a podcast where we talk about all the ways we’ve seen AWS used and abused in the wild, with a healthy dose of complaining about AWS for good measure. Today, we’re going to talk about the people side of technical projects, especially people who might introduce roadblocks for completing technical projects. Now, you might be thinking to yourself, “Jesse, Amy, that sounds like it is not about AWS.” But let me assure you that any project involving AWS is going to involve multiple different personalities approaching the project from different angles, who ultimately all have the same solution in mind, but have different ideas about how to get that problem done, or different ideas about what’s the right thing that ultimately get done to begin with. So today, we want to talk about that: we want to dive into how can you have really rewarding conversations with those folks? How can you better engage people who are intentionally or unintentionally difficult?

Amy: We want to be very clear that we are not trying to come after anyone. Every time that I’ve gotten an engagement, it isn’t because someone means to be difficult, but maybe there’s a project timeline, maybe there’s something else getting in the way of them being able to fully be present for that specific part of the engagement, and maybe that’s just what’s causing friction, causing speed bumps. And we’re all well aware of this. Jobs are hard, and especially this sort of work can be difficult. So, first of all, we totally understand, and this is just more about how to get everyone moving in the same direction at the same pace.

Jesse: Yeah, absolutely. I mean, especially with the pandemic going on right now, everybody’s doing remote work, some people have never actually met their teammates in person and they’re expected to work together efficiently, and quickly, and easily. It’s hard.

Amy: It also doesn’t help that when we do come in, we come in under the context of a cost optimization project, or some other efficiency-type title. And that sounds a little like the Bobs from Office Space, which I bring up a lot, especially during internal meetings. And it makes it sound like we’re going to come in to shake a bunch of things up and look for inefficiencies where there aren’t any, which is truly not the case. And it can cause a lot of insecurity, especially about how someone thinks that they’re doing their job, or that their job is somehow going to be impacted by what our suggestions are. It may not just be us, but it may be another migration consultation suite, where someone’s coming in to change the architecture that they’ve worked on for a long time, and that can put a lot of people in a state of
unease.

Jesse: And I think it’s also important to note that it’s not just about an external party coming in like Duckbill Group or another external, third-party consulting service, or technical group. It could be an internal separate team. It could be your internal cloud cost management team that is starting conversations with development teams saying, “Hey, I want to better understand how you’re using AWS. I want to understand some of these cost optimization opportunities.” Even in situations like that where all of these conversations are internal within the company, even within teams, there are still multiple different personalities, multiple different people approaching the problem from different angles, and it’s still really, really important to make sure that you approach them collaboratively.

Amy: And ultimately, we wanted to be clear that what we’re going to be talking about is helping people think differently into a growth mindset, and being able to do this work without anyone feeling shame or embarrassment.

Jesse: Yeah. Growth mindset is so critical. It’s something that I love to talk about ad nauseum, and so I won’t dive into it too deeply here, but—

Amy: That’s another episode.

Jesse: [Laugh]. Exactly. Growth mindset is so important for folks in technology teams, especially in today’s technology era where there’s just so much constantly innovating. There’s so much new constantly going on around you, to new technologies, new teams, new ideas, new ways of doing things, new processes, new tools; it’s really important to be open-minded to learning those different things. You don’t have to use every single one of them, but be open-minded to different people approaching problems from different perspectives and different angles.

Amy: Having to face all of this uncertainty will cause some to not be the most cooperative when they have to start reacting into these situations, whether it is an internal change that’s happening, or if it’s an external consulting group; they can start coming back and taking a various sort of stance, and just like being back in middle school, sometimes standing up to a bully is simply how you have to succeed because it’s not about dominating, it’s about compromise and trying to find out what you’re trying to do and find that common ground.

Jesse: Yeah, absolutely. I think that’s the most important part here because when we talk about working with other personalities that are different than yours and having conflict, it’s not about dealing with them; it’s not about overcoming them from the perspective of winning the argument, so to speak. It’s about how do you compromise? How do you effectively find that common ground and move forward together? And sometimes it’s just about sharing context, it’s just about sharing that mental model that you have that might be different than the mental model that this other person has, or maybe the other team has.

Like for example, some teams that we’ve talked to can’t make a cost optimization change due to security, or legal, or product SLA restrictions, but maybe the person who’s coming in from the cloud cost management team or cloud cost management side doesn’t know that because they aren’t as familiar with the product.

Amy: But it can also just be a staffing issue. These projects take work, and if an engineering team is already stressed and stretched to the edge, they’re not going to have the resources, and they don’t want to be the ones to say we simply don’t have the manpower to do this.



Jesse: Yeah, absolutely. And it’s so, so important to be able to identify those bottlenecks or identify those constraints. Because ultimately, if you give a team that already has a ton of things on their roadmap new work and say, “Hey, cost optimization is important.” They’re most likely going to ask you, “Okay, where does this fit into all the other things that are already on our roadmap?” We’ve talked to a lot of companies who struggle with that balance of prioritizing new feature developments with cost optimization work.

And there definitely needs to be that healthy balance between the two because new feature work is obviously important for the business to grow, but cost optimization work is also important within these processes as they go through more and more agile sprints, build more and more things to make sure that each team understands what are their opportunities to really optimize their spend as they’re building new architecture.

Amy: We’ve all seen that bug board where everything is a P1 bug.



Jesse: Yes. The thing that’s coming to mind for me is if it’s everyone’s responsibility it’s nobody’s responsibility. And in the same way, if everything is a priority one problem that needs to be fixed, then essentially nothing as a priority one problem that needs to be fixed, nothing is going to get done because the teams are going to get completely burned out with context-switching constantly between one priority and the next, rather than being able to actually focus on each piece of work as it comes up.

Corey: This episode is sponsored in part by our friends at Lumigo. If you’ve built anything from serverless, you know that if there’s one thing that can be said universally about these applications, it’s that it turns every outage into a murder mystery. Lumigo helps make sense of all of the various functions that wind up tying together to build applications.

It offers one-click distributed tracing so you can effortlessly find and fix issues in your serverless and microservices environment. You’ve created more problems for yourself; make one of them go away. To learn more visit lumigo.io.

Jesse: So, I was thinking about some of the best ways that we’ve seen this kind of work handled within organizations, some of the best ways I’ve seen this handled at previous companies that I’ve worked at, some of the best ways that we’ve handled this with some of the clients we’ve worked with. I actually ended up listening to a really great podcast episode about the science of productive conflict, and I’ll link that in the [show notes 00:10:18]. It basically broke down three types of conflicts; you have task conflicts, relationship conflicts, and status conflicts.

Task conflicts are your disagreements about a problem, a solution, or decision. Maybe I think that the best way to implement this new feature is in Python. You may think the best way to implement this new feature is in Golang, or Ruby, or something else. That kind of disagreement is a task conflict.

The next one is relationship conflicts where you’ve talked about differences in personalities or values. So, maybe I come from a world where product is the most important thing in everything that I do, so I always want to make sure that we focus on feature things first; I always want to make sure that I am doing what is asked of me. I always want to make sure that I am a yes man, so to speak. Whereas maybe you come from an environment or a space where you are more open to pushing back, having collaborative conversations. And it’s just different mental models of how we have been raised in the world, how we view the world, and it’s really, really difficult for us to get out of those different models, so it’s a harder type of conflict to have, relationship conflicts.

And then the third one is status conflicts where we disagree about where we fit into this hierarchy that we’re in together. Basically, who’s in charge? Who gets to decide what gets done?

Task conflicts and status conflicts can be productive; relationship conflicts are generally not productive because like I mentioned, relationship conflicts really come from people with different mental models, different views of the world and it’s unlikely that you’re going to change someone’s fundamental values. But with that said, having a conversation with the other person, establishing psychological safety, giving them that space to say, “Hey, I want to know more about how you view this problem, so that I can also share how I view this problem, and we can better understand each other,” that’s going to make a world of difference in helping both sides better understand each other, and then also find the right solution, better opportunities for maybe a cost optimization team to learn that, maybe, there are these restrictions that mean that they can’t apply all the savings opportunities that they want to, and that’s fine; that’s a business decision that the business and the organization needs to make. But it helps both teams understand that so that they can have more constructive conversations about where, where are the opportunities in areas that we can make cost optimization improvements? We can have good technical conversations together?

Well, if you’ve got questions you’d like us to answer go to lastweekinaws.com/QA. If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us how you try to form those more collaborative, psychologically safe conversations when you run into conflicts in your organization. Thanks again.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 28 May 2021 03:00:00 -0700
The 17 Ways to Run Containers on AWS

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/the-17-ways-to-run-containers-on-aws



Never miss an episode



Help the show



What's Corey up to?

Wed, 26 May 2021 03:00:00 -0700
Tim Banks Has Entered the Chat
AWS Morning Brief for the week of May 24, 2021 with Corey Quinn.
Mon, 24 May 2021 03:00:00 -0700
Build vs. Buy

Transcript


Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.


Jesse: Hello, and welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.

Jesse: This is the podcast within a podcast where we talk about all the ways we’ve seen AWS used and abused in the wild. With a healthy dose of complaining about AWS for good measure. Today, we’re going to be talking about build versus buy. I feel like this is really kind of a classic engineering conversation. Amy, what is the build versus buy idea?

Amy: It’s really the idea of whether you decide to use a managed service or SaaS product versus rolling your own and building yourself. It’s very easy to do these days with a few watches on YouTube, maybe some blog articles. You can also do repairs on my house, which is why I always have to get repairs done on my house. [laugh].

Jesse: [laugh]. Yeah, I feel like as much as I love the world of HGTV and the DIY network, I think I can do more than I actually can and I feel like it’s probably a lot safer to just let a professional take the reins. I mean, there’s so many certification programs that teach you how to build and manage your own engineering things, your own distributed databases, your own Kubernetes clusters, your own streaming data platform, and it’s really great to understand the fundamental building blocks of these systems, it’s really great to understand how they work so that ultimately if you are consuming from them or managing them, that you understand the ins and the outs of the system. But the question becomes, do you really need to be the one that’s managing that system? Do you really need to be the one spending your time managing that system on top of writing code for your microservices, on top of managing the architecture, the application, all of the components of your service that are critical?

Amy: So, I guess what we really want to decide is, in what use cases is it okay to build something from scratch, and when is it okay to, essentially, just go to the market and look for something that’s made already?

Jesse: Yeah. And I think that’s the main question that a lot of folks ask: what is the defining line? What are the questions they should think about as they are choosing to build versus buy?

Amy: I think if you want to really look at building a product, and really from the ground up—you have this product in mind and you want to do all the architecture, control it end-to-end—unless this is your core product feature or you’re going to package it for either internal or public release, you almost always—you don’t want to build this yourself because someone has probably built it in a way that’s not going to cause your engineers time or money. Unless it is going to directly make you money, then yes. If this is tied to your product income and your product revenue, please build it yourself. It avoids a lot of licensing issues, you do get to control how it works, how you want it to work. But that said a lot of products, just a bunch of assassins in a trench coat anyway, so—

Jesse: [laugh].

Amy: —it really depends on what’s important to you.

Jesse: Yeah, I feel like this is one of the biggest pitfalls that I see in a lot of organizations where they think about how they want to build out an architecture and they choose that a solution like, stateful distributed service is going to be the right thing that they want. And one of the developers says, “Oh, that’s easy. I can build that in a weekend.” And then they go off and build it, and then they’re stuck managing that system for all of eternity when that’s not the primary purpose of the team that they’re working on, that’s not the primary purpose of the product that they’re working on. So, if you’re going to build something that is directly related to your product, directly related to your business use case, directly related to how your company is making money, something that is absolutely your bread and butter, you definitely want to build that rather than buying that off the shelf.

Because building it will give you that great opportunity to focus on controlling all the ins and outs of the system, understanding all the parts of the system, finding the flexibility when you need flexibility, really fine-tuning and honing all the parts of the system in the way that you need it to work so that ultimately your organization is getting the best bang for their buck out of the system, whereas in a lot of cases, you’re not going to get the same level of flexibility from an off the
shelf solution.

Amy: And especially if you’re going to go in and planning to build your own supporting product, make sure—and I said this before, I’ll say it again—you do check the licenses of any libraries and any SaaS products you use to build it because I reinvented the wheel plenty of times in my career, specifically because I worked in a place where the licensing we were allowed to use would not allow us to use very specific products.

Jesse: Yeah. That’s such a critical business risk and something that I think not every engineer is fully aware of. And to be clear, I don’t think that’s the engineer’s fault. I think that’s part of best practices that every organization can get better at to make sure that everybody understands, what are our limitations on using modules, using open-source solutions from the internet? How can we make sure that we ultimately aren’t creating additional unnecessary business risk?

Amy: When do we go shopping?

Jesse: [laugh]. Yeah, let’s go shopping. Let’s say you’ve decided that the piece of software that you want is not part of your bread and butter, like we were saying. If it’s not part of your organization’s primary product, primary use case, don’t waste engineering time building it for yourself, pay a vendor or a subject matter expert to build it for you—or to manage it for you, even—and then call it a day. It is absolutely worth those trade-offs. The additional cost of paying somebody else to manage it for you is absolutely worthwhile because you then get the opportunity to stay focused on the things that are most important to your team and your business.

Corey: If your mean time to WTF for a security alert is more than a minute, it’s time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you’re building a secure business on AWS with compliance requirements, you don’t really have time to choose between antivirus or firewall companies to help you secure your stack. That’s why Lacework is built from the ground up for the cloud: low effort, high visibility, and detection. To learn more, visit lacework.com.

Amy: You also don’t end up trapped by having to make sure the product is appropriately upgraded or patched. And then you also have that nice little space of liability, saying we just bought this off the shelf. They said it was safe, and we trusted them. [laugh].

Jesse: Yeah, again, business risk conversations, there is absolutely that opportunity for third-party liability rather than internal liability in some of those security risks. I also feel like it’s important to add that AWS, for example, has tons of managed services that give you ease of use by removing that administrative overhead. Yes, we’re primarily focused on AWS, obviously, this is an AWS focused podcast, there is definitely going to be a best and worst use case for these products so I’m not saying that you should go out and start using these all immediately without thinking about your overall goal and use case, but in a lot of cases, again, if the solution that you want is not something that you need to manage yourself, that you need to focus on building and running yourself, give it to AWS, they have tons of these managed
solutions available to you built into the ecosystem.

Amy: And that’s true of all the large cloud providers. They have managed services to make the things that you do not have the staffing to be an expert in and do all that work for you. And it’s not as if you are locked into these solutions. When you buy into either a SaaS product or a managed service, you can migrate off if you feel like you can build it better, and you actually have spent the time in R&D, and you spent the time building out a minimum viable product, and you know that this use case works for you, and then you can either clear out overhead or fees, and you can actually come in under what you’re spending right now, then make that move. But do it after you already know what it is you want.

Jesse: Yeah, I think that’s a really great use case example, Amy. One other thing that I want to talk about is that this build versus buy a conversation so far has been focused on your organization thinking about if they want to build something internally, or if they want to buy it from a third-party vendor. But this conversation can also happen internally, in a single organization, between teams. I’m thinking about some organizations that I’ve worked for where I’ve seen one team build and manage a central platform solution, like a central CI/CD pipeline that every other team is going to be using and consuming from. But then, one team decides that the CI/CD platform that everybody’s using doesn’t really do all the things that they want it to do, so they decide to go off and build their own CI/CD platform internally for their team instead, rather than working with the team that is actually owning this sort of centralized CI/CD platform to make sure that everybody is getting the benefits of these additional features, these additional solutions, these additional bug fixes that the team was asking for.

Amy: It’s really hard when you can’t see the forest for all the silos.

Jesse: Yes. Absolutely. It is so, so critical to think about building these feedback loops into your internal tools. Because if your customers are internal to your organization, they’re going to want to provide that feedback in some capacity to help you understand when the service that you’re building is fantastic and when the service that you’re building is awful. And it’s so, so critical to make sure that you have those easy feedback loops so that you can continue to iterate on the things that you choose to build internally and hone them and make them better.

If you’ve got questions that you’d like us to answer go to lastweekinaws.com/QA. If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us the criteria you think about when considering whether you should build or buy.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 21 May 2021 03:00:00 -0700
New CEO Onboarding at AWS

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/New-CEO-Onboarding-at-AWS



Never miss an episode



Help the show



What's Corey up to?

Wed, 19 May 2021 03:00:00 -0700
Adam Selipsky's Day One Coreyentation
AWS Morning Brief for the week of May 17, 2021 with Corey Quinn.
Mon, 17 May 2021 03:00:00 -0700
Cloud Cost Management Starter Kit

Transcript
Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Jesse: Welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Negrette.



Jesse: This is the podcast within a podcast where we talk about all the ways we’ve seen AWS used and abused in the wild, with a healthy dose of complaining about AWS for good measure because I mean, who doesn’t love to complain about AWS? I feel like that’s always a good thing that we can talk about, no matter the topic. Today, we’re going to be talking about the ‘cloud cost management starter kit.’ So, the starter kit seems to be a big fad that’s going around. If you’re listening to this episode, you’re probably thinking, “It’s already done. It’s over.”



But I still want to talk about it. I think that this is a really relevant topic because I think a lot of companies are trying to get started, get their hands started in cloud cost management. So, I think this would be a great thing for us to talk about: what’s in our cloud cost management starter kit?



Amy: And it really will help answer that question that I get asked a lot on: what is even a cloud economist, and what do you do?



Jesse: Yeah, I mean, given the current timeframe, I haven’t gone to any parties recently to talk about what I do, but I do feel like anytime I try to explain to somebody what I do, there’s always that moment of, “Okay. Yes, I work with computers, and we’ll just leave it at that.”



Amy: It’s easier to just think about it as we look at receipts, and we kind of figure things out. But when you try to get into the nuts and bolts of it, it’s a very esoteric idea that we’re trying to explain. And no, I don’t know why this is a real job. And yet it is.



Jesse: This is one of the things that always fascinates me. I absolutely love the work that I do, and I definitely think that it is important work that needs to be done for any organization, to work on their cloud cost management best practices, but it also boggles my mind that AWS, Azure, GCP, haven’t figured out how to bake this in more clearly and easily to all of their workflows and all their services. It still boggles my mind that this is something that exists as—

Amy: As a thing we have to do.

Jesse: As a thing we have to do. Yeah, absolutely.

Amy: Well, the good news is, they’re going to change their practices once every six weeks, and we’ll have a new thing to figure it out. [laugh].

Jesse: [laugh]. So, let’s get started with the first item on our cloud cost management starter kit. This one is something that Amy is definitely passionate about; I am definitely passionate about, as well. Amy, what is it?

Amy: Turn on your CUR. Turn on your CUr. If you don’t know what it is, just Google AWS CUR. Turn it on. It will save you a headache, and it will save anyone you bring in to help you [laugh] [unintelligible 00:02:59] a huge headache. And it keeps us from having to yell at people, even though that’s the thing that if you pay us to do it, we will totally do it for you.

Jesse: If you take nothing away from this episode, go check out the AWS Cost and Usage Report—otherwise known as CUR—turn it on for your accounts, ideally enable it in Parquet format because that’s going to allow you to get all that sweet, sweet data in an optimized manner, living in your S3 bucket. It is a godsend. It gives you all the data from Cost Explorer, and then some. It allows you to do all sorts of really interesting business intelligence analytics on your billing data. It’s absolutely fantastic.

Amy: It’s like getting all of those juicy infrastructure metrics, except getting that with a dollar sign attached to it so you know what you actually doing with that money.

Jesse: Yeah, this definitely is, like, the first step towards doing any kind of showback models, or chargeback models, or even unit economics to figuring out where your spend is going. The Cost and Usage Report is going to be a huge first step in that direction.

Amy: Now, the reason why we yell at people about this—or at least I do—is because AWS will only show you the data from the time that it is turned on. They do have it for historical periods, but if you enable it at a specific point, all of your reports are going to start there. So, if you’re looking to do forecasting, or you want to be able to know what your usage is going to be looking like from this point on, turn it on as early as possible.

Jesse: Absolutely. If you are listening to this now and you don’t have the CUR enabled, definitely go pause this episode, enable it now, and come back and listen to the rest of the episode because the sooner you have the CUR enabled, the sooner you’ll be able to get those sweet, sweet metrics for all of your—

Amy: And it’s free.

Jesse: [laugh]. Yeah, that’s even the more important part. It’s free. There’s going to be a little bit of data storage costs if you send this data to S3, but overall, the amount of money that you spend on that storage is going to be optimized because you’re saving that CUR data in Parquet format. It’s absolutely worthwhile.

All right, so number two; the second item on our cloud cost management starter kit, is getting to know your AWS account manager and account team. This one, I feel like a lot of people don’t actually know that they have an AWS account manager. But let me tell you now: if you have an AWS account, you have an AWS account manager. Even if they haven’t reached out to you before they do exist, you have access to them, and you should absolutely start building a rapport with them.

Amy: Anytime you are paying for a support plan, you also have an account manager. This isn’t just true for AWS; I would be very surprised for any service that charged you for support but did not give you an account manager.

Jesse: So, for those of you who aren’t familiar with your account manager, they are generally somebody who will be able to help you navigate some of the more complex parts of AWS, especially when you have any kind of questions about your bill or about technical things using AWS. They will help you navigate those resources and make sure that your questions are getting to the teams that can actually answer them, and then make sure that those questions are actually getting answered. They are the best champion for you within AWS.

If you have more than a certain threshold of spend on AWS, if you’re paying for enterprise support, you likely also have a dedicated technical account manager as well, who will be basically your point person for any technical questions. They are a great resource for any technical questions, making sure that your technical questions are answered, making sure that any concerns that you have are addressed, and that they get to the right teams. They can give you some guidance on possibly how to set up new features, new architecture within AWS. They can give you some great, great guidance about the best ways to use AWS to accomplish whatever your use case is. So, in the cases where you’ve got a dedicated technical account manager as well, get to know them because again, they are going to be your champion. They are here to help you. Both your account manager and your technical account manager want to make sure that you are happy with AWS and continue to use AWS.

Amy: And the thing to know about the account manager is, like, if you ever run into that situation where, oh, something was left on erroneously and we ended up with a spike, or this is how I was understanding the service to work and it didn’t work that way, and now I have some weird spend, but I turned it off immediately, if you ever want to get a refund or a credit or anything, these are the people to talk to; they’re the ones who are going to help you out.

Jesse: Yeah, that’s a great point. It’s like, whenever you call into any kind of customer support center, if you treat the person who answers the phone with kindness, they are generally more likely to help you solve your problem, or generally more likely to go out of their way to help you solve your problem. Whereas if you just call in and yell at them, they have no interest in helping you. So—

Amy: You’ll never see that refund.

Jesse: Exactly. So, the more that you can create that rapport with your account manager—and your technical account manager if you have one—the better chances that they will fight for you internally to go above and beyond to make sure that you can get a refund if you accidentally left something running, or make sure that any billing issues are taken care of extremely fast because they ultimately have already built that rapport with you. They care about you and the way that you care about them and the way that you care about continuing to use AWS.

Amy: There’s another note about the technical managers where if you are very open with them on what your architecture plans are—“We’re going to move into this type of EKS deployment. This is the kind of traffic we think we’re going to run, and we think it’s going to be shaped this way”—they’ll help you out and build that in most efficient way possible because they also don’t want the resources out there either being overutilized or just being run poorly. They’ll help you out in trying to figure out the best way of building that. They’ll also—if AWS launches a new program and you spent a lot of money on AWS, maybe there’s a preview program that they think will help you solve a very edge case kind of issue that you didn’t think you had before.

Jesse: Absolutely.

Amy: Yeah. So, it’s a great way to get these paths and get these relationships because it helps both parties out.

Corey: This episode is sponsored in part by VM Ware. Because lets face it, the past year hasn’t been kind to our AWS bills or, honestly, any cloud bills. The pandemic had a bunch of impacts. It forced us to move workloads to the cloud sooner than we would otherwise. We saw strange patterns such as user traffic drops off but infrastructure spend doesn't. What do you do about it? Well, the CloudLive 2021 Virtual Conference is your chance to connect with people wrestling with the same type of thing. Be they practitioners, vendors in the space, leaders of thought—ahem, ahem. And get some behind the scenes look into the various ways different companies are handling this. Hosted by Cloudhealth by VM Ware on May 20th the CloudLive 2021 Conference will be 100% virtual and 100% free to attend. So you really have no excuses for missing out on this opportunity to deal with people who care about cloud bills. Vist cloudlive.com/corey to learn more and save your virtual seat today. Thats cloud l-i-v-e.com/corey c-o-r-e-y. Drop the “e,” we’re all in trouble. My thanks for VM Ware for sponsoring this ridiculous episode.

Jesse: So, the third item on our cloud cost management starter kit is identifying all of your contracts. Now, I know you’re probably thinking, “Well, wait. I’ve just got my AWS bill, what else should I be thinking about?” There’s other contracts that you might have with AWS. Now, you as the engineer may not know this, but there may be other agreements that your company has entered into with AWS: you might have an enterprise discount program agreement, you might have a private pricing addendum agreement, you might have an acceleration program—migration program—agreement. There’s multiple different contracts that your company might have with AWS, and you definitely want to make sure that you know about all of them.

Amy: If you’re ever in charge of an architecture, you’re going to want to know not just what your costs are at the end of the day, but also what they are before all your discounts because those discounts can maybe camouflage a heavy usage if you’re also getting that usage covered by refunds and discounts.

Jesse: Absolutely, totally agreed. Yeah, it’s really, really important to understand, not just your net spend at the end of the day, but your actual usage spend. And that’s a big one that I think a lot of people don’t think about regularly and is definitely important to think about when you’re looking at cloud cost management best practices and understanding how much your architecture is actually costing you on a team-by-team or product-by-product basis.

Amy: Also, make sure if you’re doing reservations that you know when those reservations and savings plans ent—

Jesse: Yes.

Amy: —because you don’t want to have to answer the question, “Why did all of your costs go up when you actually have made no changes in your infrastructure?”

Jesse: Yeah. Half the battle here is knowing that these contracts and reservations exist; the other half of the battle is knowing when they expire so that you can start having proactive conversations with teams about their usage patterns to make sure that they’re actually fully utilizing the reservations, and fully utilizing these discounts, and that they’re going to continue utilizing those discounts, continue utilizing those reservations so that you could ultimately end up purchasing the right reservations going forward, or ultimately end up renegotiating at the correct discount amount or commitment amount so that you are getting the best discount for how much money you’re actually spending.

So, the last item on our cloud cost management starter kit is thinking about the non-technical parts of projects. Amy, when you think about the non-technical parts of projects, what do you think about?

Amy: Non-technical always makes you think of people and process. So, this would be the leadership making the decisions on what those cost initiatives are. Maybe they want to push this down to the team lead level: it would include that. Or maybe they want to push it down to the engineering level, or the individual contributor level. There are some companies that are small enough that an engineer can be completely cognizant and responsible for the spend that they make.

Jesse: Yeah. I think that this is a really, really critical item to include in our starter kit because leadership needs to be bought into and back whatever work is being done, whatever cloud cost management work is being done. But also teams need to be empowered to make the changes that they want to make, make the changes that will ultimately provide those cloud cost management optimization opportunities and better cost visibility across teams. So, does everybody know what their teams are empowered to do, what their teams are capable of? Does everybody know what their teams are responsible for on the flip side? Do they ultimately know that they are responsible for managing their own spend, or do they think that the spend belongs to somebody else? Also, do they understand which resources are part of their budget or part of their spend?

Amy: It’s the idea that ownership of—whether it’s a bill, whether it’s a resource—comes down to communication, and level setting. Do we know who owns this? Do we know who’s paying for it? Do they know the information in the same way? Is there someone who’s outside who can figure out this information for themselves? Just making sure that it’s done in a clear enough way that everyone knows what’s going on.

Jesse: Absolutely. Well, that will do it for us this week. Those are our four main items for our cloud cost management starter kits. If you’ve got questions you’d like us to answer, please go to lastweekinaws.com/QA, fill out the fields and submit your questions.

If you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us, what would you put in your ideal starter kit?



Announcer: This has been a HumblePod production. Stay humble.

Fri, 14 May 2021 14:13:18 -0700
Security is Someone Else’s Job Zero

Want to give your ears a break and read this as an article? You’re looking for this link.https://www.lastweekinaws.com/blog/security-is-someone-elses-job-zero



Never miss an episode



Help the show



What's Corey up to?

Wed, 12 May 2021 03:00:00 -0700
AWS Morning Brief Trailer
The latest in AWS news, sprinkled with snark. Posts about AWS come out over sixty times a day. We filter through it all to find the hidden gems, the community contributions--the stuff worth hearing about! Then we summarize it with snark and share it with you--minus the nonsense.
Tue, 11 May 2021 14:47:48 -0700
Time to Fire the DevOps Guru
AWS Morning Brief for the week of May 10, 2021 with Corey Quinn.
Mon, 10 May 2021 03:00:00 -0700
A Very Special Episode

Transcript

Corey: This episode is sponsored in part byLaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visitlaunchdarkly.com and tell them Corey sent you, and watch for the wince.


Jesse: Today, on a very special episode of AWS Morning Brief: Fridays From the Field, we say our goodbyes to Pete Cheslock.

Amy: Oh, no. Did the ops bus finally get him?

Jesse: No. Wait, what? What? No. No, he’s not—

Amy: You know, the ops bus, the one that takes out all of the ops people, which is why you need data recovery plans.

Jesse: [laugh]. I mean, I have plans for other reasons, but no. No, Pete, Pete’s not dead. He’s just—I mean, he’s dead to me, but he’s just not going to be here anymore.

Amy: Only on the inside.

Jesse: Welcome to AWS Morning Brief: Fridays From the Field. I’m Jesse DeRose.

Amy: I’m Amy Arumbulo Negrette.

Pete: I am Pete Cheslock. I’m here for one last, beautiful, glorious time.

Jesse: I feel like this is going to be like Breakfast Club but in the data center server room.

Pete: Yeah. A little bit. I think so. We will all sit cross-legged on the floor in a circle, share our thoughts and feelings. And maybe some sushi. There were sushi in that movie. And that was, like, really advanced back then in the ’80s.

Jesse: Yeah, I like that. So Pete, you want to give us a little bit of background about why you will be moving on from this podcast?

Pete: Moving on to a whole new world. Yes. Sadly, I am not dead. The ops bus did not get me, and I was not eaten by my smoker, my meat smoker.


Jesse: [laugh]. Although at this point, it’s probably overdue.

Pete: You know, the odds of all three of those are pretty high out, to be really perfectly honest, given this pandemic and everything else going on in this world.

Amy: Isn’t that how it works? You eventually become the smoked meat.

Pete: Yeah, yeah.

Jesse: [laugh].

Pete: All the time. You know, you are what you eat. And if you eat junk and whatnot—so I eat smoked meats, eventually, I’m just going to become, you know, smoked meats, I guess. But no, I am moving on from The Duckbill Group. Just bittersweet is the best word I can come up with. Very sad, but also very excited.

I’m moving on to a new role at a new company that was just kind of an opportunity that I couldn’t pass up. And I’m really excited for something new, but really sad because I don’t get to work with two of my three favorite cloud economists, Jesse, and Amy. Yeah, Corey is one, too, and yes, it’s fun to work with him. But it’s also fun to rag on him a little bit as well.

Jesse: I’m pretty sure you still have the opportunity to rag on him no matter where you go.

Pete: Yeah, that’s true. I mean, we’re Twitter connected. So, I can just slide into his DMs as needed. Yeah.

Amy: And really, what else is Twitter for—

Pete: Exactly.

Jesse: [laugh].

Amy: —than roasting former coworkers and bosses?

Pete: Yeah, I expect a constant stream of Twitter DMs every time you find something, some little fun nugget that I’ve left behind.

Jesse: I feel like that’s appropriate. So today, Pete, I have two questions for you now that you will be moving on from Duckbill Group, moving on from this podcast, I want to know, looking back at your time here working with Duckbill Group, what did you learn? What are the things that surprised you, that you didn’t expect? And what would you say to somebody who wanted to start working in this space, maybe start a career in cloud economics on their own?

Pete: Yeah, so this kind of feels like an exit interview a little bit.

Jesse: [laugh]. And a very public exit interview at that. So, make sure that we bleep all the swear words.

Pete: I think it’s in Duckbill fashion to do a public—a very public-facing exit interview, right? That is Duckbill in a nutshell.

Jesse: I think the only thing more public is if Corey asks you to hold the exit interview on Twitter.

Amy: Exactly.

Pete: [laugh]. I mean, we might have to do that, now. I like that idea. Yeah, so I think those are great questions, and I love the opportunity to talk about it. Because Duckbill is a fantastic company, and coming into Duckbill last year was totally by luck.

Not really—no, not—luck is maybe not the right word. But I had been doing some consulting on my own, and the pandemic and some other forces caused a bunch of my consulting work to dry up really quickly. And I was sitting at home and I’m like, “Wow, I should get a real job.” And I saw a tweet from Mike on Twitter that was like, “Oh, we’re growing The Duckbill Group.” And Mike and Corey and I have known each other for such a long time.

We’ve always said it’d be great to work together at some point in the future, but it’s so hard [laugh] to do. You know, to kind of work with your friends, and timing, and circumstance, and schedule, and everything else. And so when I saw that, I was like, wow, like that might be a lot of fun working with that crew. And I’ve got a lot of experience in AWS and I’ve—my title at one of my previous companies was Captain COGS—for Cost Of Goods Sold—because I was so diligent with the Amazon bill. So, it’s kind of one of those things where I felt like I could be useful and helpful to the organization, and talking with Mike and Corey, it just made a ton of sense.


And so, it was a lot of fun to come on board. So, but then once you’re kind of in, and you start doing this type of work—and you know, Amy and Jesse, you’ve both experienced this—I think no matter how much knowledge you have of Amazon, very, very quickly, you realize that you actually don’t know as much as you really think you did, right?


Jesse: Yeah.

Pete: Because it’s so—there’s just so much.


Amy: And it changes once every five minutes.


Pete: [laugh].


Jesse: Oh, yeah.


Amy: Literally if you—well, just keep an eye on that changelog, you can watch your day get ruined as time goes on.


Jesse: [laugh].


Pete: [laugh]. It’s—yeah, it’s a real-time day ruining. And that’s the new. It’s like Amazon Kinesis: It’s all real-time.

Jesse: [laugh].


Pete: Yeah, it’s so true. And I think the reason behind it is, you know, one of the first things I kind of realized is that when you are working inside of a business and you’re trying to understand, like, an Amazon service, you don’t necessarily go that deep because you’ve got a real job and other stuff to do. And when you’re finally, like—let’s say you’re in Cost Explorer; this is actually my favorite one because learning this took us a while. The documents aren’t very good. But in Cost Explorer, there’s a dropdown box that can show you your charges in different ways: unblended view, blended view, amortized view—if I’m saying that word really incorrectly—net-amortized view, net-unblended view. Like, what do all these mean?


Most people just are like, unblended, move on with their lives. But at some point, you kind of need to know and answer that question, and then understand the impact, and all those things, and spending more hours than I care to count trying to correlate the bill and Cost Explorer to look the same. Something that simple, why is that so hard? You know, it’s things like that.

Amy: Why is that so hard? I do not understand it. It is exhausting. [laugh].

Jesse: It drives me absolutely crazy, and it’s something that in previous roles, you could just say, “Well, this isn’t my responsibility, so I’m not going to worry about it.” But now we’ve got clients who are asking us these questions because it is our responsibility and we do need to worry about it.


Pete: Yeah, exactly. So, I think that’s just, kind of, one example. Now, there was a ton that I learned. I mean, just in how discounts might be applied when you look at charges in an account whether if you have an enterprise discount program, or private pricing in some way. I think one of my favorite ones—and this is actually something that catches a lot of people up—is especially in Cost Explorer, there’s kind of two ways that you can view a charge.

So, let’s say you’re looking at S3, and you are trying to find your usage by the usage type. Like, I want to compare standard storage to maybe data transfer or something like that. And you go and group by usage type, and they’ll show you, “Hey, for your S3 for this month or day or whatever, you’ll have some spend associated storage and data transfer,” and you’re like, “That’s neat.” And then you say to yourself, “Now, I want to look at it by API.” And maybe you’ll see, wow, there’s a ton of spend associated with GETs or PUTs.

And you’ll think that that is actually a request charge. And it’s totally not. It’s like, when you group by API, it’s the API that started the charge, not the charge itself. So, you could have a PUT that started the charge, but the charge itself is actually storage. It’s the little things like that, where you might glance at it in your account and go, “Oh, okay.” But then when you actually need to get down to the per penny on spend and share it with a client, you go even further down the rabbit hole.

Jesse: Because why would all of the billing information across different sources be accurate?

Amy: And also, why would things be named the same between the bill, and Cost Explorer, and the curve? Having those names be the same, that would just make it too easy, and just streamline the process too much, and be too logical. No, let’s work for it. We have to work for it. It’s a pillar of excellence; we have to work for it.


Pete: [laugh]. Exactly. So yeah, I think it’s those types of things that you just start seeing the edge cases. But because of, kind of, the work we do, we keep going. We’re not just, “Oh, wow. Haha, silly Amazon.”


But then we keep diving in deeper and deeper to figure out the why. And the reason for that really just comes down to the fact that we’ll need to communicate that in some effective way to the client to get them to understand it. And actually, that kind of leads me to the other thing that I think is probably the most important skill of being a cloud economist, of being in finops, is your ability to write long-form writing, being able to write clear, concise information explaining why the spend is what it is, explaining all of these edge cases, all these interesting parts of cloud cost management, and being able to write that down in such a way that anyone could read it; like a CFO could understand how the charges are happening, just like a head of engineering, who has maybe more impact to the spend.


Jesse: Being able to communicate, the differences between different AWS services, between different billing modes, to different audiences is so critical to the work that we do because we’re ultimately going to be working with different people with different backgrounds at every single client that we work with. So, we need to be able to speak the language of different audiences.

Amy: And it’s really different, how different C Suites, different departments, their goals are going to be different, too, because they have requirements that they have to fulfill. Finance is very concerned about the literal cost of things, while engineering is—they understand that their architecture comes at a price, and so long as they have the budget for it, they’re cool with it. And you just have to align what those goals are, and have that translate as like, into the document as, “They built it this way for this reason, which was fine at that stage. But as you grow, you need to make sure that it also fulfills these other external expectations.”


Corey: Let’s be honest—the past year has been a nightmare for cloud financial management. The pandemic forced us to move workloads to the cloud sooner than anticipated, and we all know what that means—surprises on the cloud bill and headaches for anyone trying to figure out what caused them. The CloudLIVE 2021 virtual conference is your chance to connect with FinOps and cloud financial management practitioners and get a behind-the-scenes look into proven strategies that have helped organizations like yours adapt to the realities of the past year. Hosted by CloudHealth by VMware on May 20th, the CloudLIVE 2021 conference will be 100% virtual and 100% free to attend, so you have no excuses for missing out on this opportunity to connect with the cloud management community. Visit cloudlive.com/corey to learn more and save your virtual seat today. That’s cloud-l-i-v-e.com/corey to register.


Pete: Yeah, that’s exactly right. I mean, it’s just—and can you imagine, you have some knowledge you want to share around something as complex as the Amazon bill. I mean we ask for a PDF of your bill when you start working with Duckbill Group. That could be hundreds of pages long, and you’re trying to distill that down into something that, really, anyone can understand. It’s a true superpower to be able to write long-form content like that really well.


And I never used to like writing. I was never—never really enjoyed it that much and over the last year, that muscle that you’re working out, now, the ability to write many, many pages around this type of content, just it comes so much more easily. So, I think that’s another big aspect, right? The more you work on it, obviously the easier it gets.


Jesse: I don’t know about you, but now that I have focused more on flexing that writing and communication muscle, I’ve noticed it more in both everyone that I work with day-to-day with Duckbill Group and also in my daily life, just watching how people communicate with each other, and how effectively people communicate with each other; it’s both amazing and nerve-wracking all at the same time.

Pete: [laugh]. I know. And even—not to say that whenever we sit down to write our reports that we give to our clients, we don’t go through the wave of emotions between the back and forth of, like, “I don’t know what to write,” and then, “Oh, I know of a lot of stuff to write. Let me just get something down.” And then you can’t stop writing. It’s just—it’s this emotional roller coaster that I feel like no matter how many times we need to write a lot of detailed information down, everyone always goes through.

Amy: And we really do have a highly collaborative process here, too, where we’re all in the same document, writing, and the person who owns any given report will always have the same stage at the end when all of the sections are filled out, where they go to one of the other people on the team and go, “Every word I put down is absolute garbage. Please help me trim it down, take it out. I don’t even care anymore. Just look at it and tell me that I wrote down words that are in some kind of human language.” [laugh].


Jesse: [laugh].

Pete: [laugh]. Oh, the plight of the writer. It’s, like, the imposter syndrome that affects the writer. It’s like, “Okay. I wrote a bunch of stuff. I think it’s terrible.” And then you sleep on it, you come back the next day, and you’re like, “Actually, this is pretty good.” [laugh].

Amy: I explained concepts. It was fine. I didn’t use a single comma for three pages, but it’s probably fine. [laugh].

Jesse: [laugh].

Pete: You can take one of mine. Usually, all of my draft documents are commas and M-dashes, just all over the place. Yeah, so I think that’s honestly a big superpower. And I think the last two things that—this is actually something that I’ve looked for in people that I’ve wanted to work with, and people I was hiring, and I see it here as well as these, kind of, two concepts of intellectual curiosity and aptitude to learn, where if you have a base knowledge around Amazon and you have those other attributes—that curiosity and truly enjoying learning—you can accelerate your ability to understand this so incredibly quickly because there’s such a wealth of information out there, and there’s so many documents, there’s so much stuff. It just requires someone who really cares enough to dive in and really want to understand.

That’s something that I think we’ve seen here is that the folks who are most successful are just—they want to know why, and they’re not satisfied until they can explain it in a simple way to someone else. That’s the key, right? The attribute of a true expert is someone who can explain something very difficult in a simple way. And I think that’s something that would be critical if you were joining Duckbill, if you were building your own finops or cloud finance team, it is so complex. It’s the intersection of technical architecture and cost, and it touches almost the entire business. So, I think those are some other attributes that I think are just incredibly helpful.

Jesse: We’re also usually not entirely satisfied until we’ve either opened a support case with AWS, responded to one of their feedback icons in the AWS documentation—the public AWS documentation—or trolled somebody on Twitter saying, “Shame on you, AWS, for writing documentation that doesn’t make sense.”


Amy: It’ll be fine. Someone in your mentions will go, “Did you check the region?” And you would have, and then it’ll still be wrong.

Jesse: [laugh].


Amy: And it’ll be fine. [laugh]. Eventually, we’ll fix it.

Pete: That one—


Jesse: Too soon.

Pete: —that one still hurts, when we—oh, I’m just like, “Why do the numbers not line up?” And then someone was like—

Amy: It's a thing I check for, even if it’s like, “It’s a global resource.” I don’t care. Just tell me. Just tell me it’s fine. [laugh].

Pete: “Are you in the right region?” Like—“Dammit, no, I’m not. Oh.” [laugh]. Yeah, that happens to the best of us.

Amy: I did, unfortunately, burn so many hours, I think it was last week trying to find out where someone had put their resources. It’s like, “Oh, not us-west-2. It’s us-west-1. Of course.” [laugh].

Jesse: So, annoying. Well, I would just like to say, Pete, it has been a joy and a pleasure working with you, it has been a joy and a pleasure complaining about AWS with you, on this podcast, so thank you for your time. That sounded really… really, really standoffish. I didn’t mean it quite as bad as it came off there. [laugh].

Pete: Well, you know, I think we need to thank Corey for having a child and thus needing to offload some of his podcast duties over to us, and then the fact that we just never gave him the podcast back, and we just took it over.

Jesse: Well, if you’ve got questions that you’d like us to answer, you can go to lastweekinaws.com/QA. And if you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us what qualities you’re looking for when building out your cloud finance team.

Pete: Thanks for coming in.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 07 May 2021 03:00:00 -0700
Developer Portals are an Anti-Pattern

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/Developer-Portals-are-an-Anti-Pattern



Never miss an episode



Help the show



What's Corey up to?

Wed, 05 May 2021 03:00:00 -0700
Jack's Nimble Studio
AWS Morning Brief for the week of April 3, 2021 with Corey Quinn.
Mon, 03 May 2021 03:00:00 -0700
Listener Questions 5

Links:



Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.


Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I am Pete Cheslock.

Jesse: I’m Jesse DeRose.


Pete: Wow, we’re back again. And guess what? We have even more questions. I am… I am… I don’t even know. I have so many emotions right now that are conflicting between a pandemic and non-pandemic that I just—I’m just so happy. I’m just so happy that you listen, all of you out there, all you wonderful humans out there are listening. But more importantly, you are going into lastweekinaws.com/QA and you’re sending us some really great questions.

Jesse: Yeah.

Pete: And we’re going to answer some more questions today. We’re having so much fun with this, that we’re just going to keep the good times rolling. So, if you also want to keep these good times rolling, send us your questions, and we’ll just—yeah, we’ll just roll with it. Right, Jesse?

Jesse: Absolutely. We’re happy to answer more questions on air, happy to let you pick our brains.

Pete: All right. Well, we got a couple more questions. Let’s kick it off, Jesse.

Jesse: Yeah. So, the first question today is from Barry. Thank you, Barry. “New friend of the pod here.” Always happy to have friends of the pod. Although I do feel like that starts to get, like, Children of the Corn, kind of. I think we started that, and I also am excited about it, and also upset with myself for starting that.

Pete: That’s all right. Friend of the pod. Friend of the pod.

Jesse: “New friend of the pod here. I work in strategic sourcing and procurement and I was curious if there are any ways that you recommend to get up to speed with managing cloud spend. This is usually closely monitored by finance or different groups in product, but I can see a significant potential value for a sourcing professional to help, also.” And that’s from Barry, thank you, Barry.

Pete: Well, I’m struggling not to laugh. “This is usually closely monitored by finance or different groups in product.”

Jesse: Yeah…

Pete: But I mean, let’s be honest, it’s not monitored by anyone. It’s just running up a meter in a taxi going 100 miles an hour.

Jesse: Yeah, that’s the hardest part. I want everybody to be involved in the cloud cost management practice, but there’s that same idea of if it’s everyone’s responsibility, it’s no one’s responsibility. And so this usually ends up at a point where you’ve got the CFO walking over to the head of engineering saying, “Why did the spend go up?” And that’s never a good conversation to have.

Pete: No, never a good one. Well, Barry because you’re a friend of the pod, we will answer this question for you. And honestly, I think it’s a great question, which is, we actually have been working with a lot of larger enterprises and these enterprises still have their classic sourcing and procurement teams. That’s not an expertise that is going away anytime soon, but like most teams within the company that are adopting cloud, it’s obviously going to evolve as people are moving away from, kind of, capital intensive purchases and into, honestly, more complex, multi-year OpEx style purchases, with cloud services and all the different vendors that come with it. It’s going to just get a lot harder.

I mean, it’s probably already a lot harder for those types of teams. And so there’s a bunch of places I think that you can go that can help level up your skills around cloud spend. And I would say the first place that I personally got to dive in a little bit more—I mean, my history has been using Amazon cloud and being a person who cared about how much my company spent on it, but when you—joining Duckbill, you need to dive into other areas around the FinOps world. And the book, the O’Reilly book, Cloud FinOps is actually a really great resource.

Yeah, I think it’s really well written and there’s a lot of great chapters within there that you can kind of pick and choose based on what you’re most interested in learning about. If you’re trying to learn more about unit economics, or you’re trying to learn more about how to monitor and track things like that, it’s a great book to dive into, and becomes a really great reference that you can leverage as you’re trying to level up this expertise within yourself or your team.

Jesse: It’s a really, really great resource. The other thing to think about is any kind of collaborative social spaces where you can be with like-minded individuals who also care about cloud costs. Now, there’s a number of meetups that exist under the FinOps title that may be worth looking into. Obviously, we’re recording this during the pandemic so I don’t recommend doing those in person. But as you are able to, there may be opportunities for in-person meetups and smaller local groups focusing on cloud cost management strategies together. But also check out the FinOps Foundation. They have a Slack space that I would love to tell you more about, but unfortunately, we’re not allowed to join. So—

Pete: Yep.

Jesse: —I can’t really say more about it than that. I would hope that you’re allowed to join, but they have some strict guidelines. So, I mean, the worst that can happen is they say no; it’s definitely worth signing up.

Pete: Yeah, and they have to us. [laugh].

Jesse: Yeah.

Pete: I think when you get into the FinOps Foundation, you should angrily say that we should have more FinOps experts in here like the great Jesse DeRose should be a member of this one because right now, he’s just framed his rejection notice from there, and—

Jesse: Oh, yeah.

Pete: —while it looks beautiful on the wall, while I’m on a Zoom with him, I want more for you, Jesse.

Jesse: I want more for me, too. I’m not going to lie.

Pete: So, I don’t know this might sound a little ridiculous that I’m going to say something nice about AWS, but they have a fantastic cost management blog. This is a really fantastic resource, really incredible resource, with a lot more content more recently. They seem to be doing some great work on the recruiting side and bringing on some real fantastic experts around cost management.

I mean, just recently within the past few months they talk about unit economics: How to select a unit metric that might support your business, talking more about unit metrics in practice. They start at the basics, too. I mean, obviously, we deal a lot in unit economics and unit metrics; they will start you off with something very basic and say, “Well, what even is this thing?” And talk to you more about cost reporting using AWS organizations for some of this. It’s a really fantastic resource.

It’s all free, too, which is—it’s weird to say that something from AWS is free. So, anytime that you can find a free resource from Amazon, I say, highly recommend it. But there are a lot of blogs on the AWS site, but again, the Cost Management Blog, great resource. I read it religiously; I think what they’re writing is some of, really, the best content on the blog in general.


Jesse: There’s one other book that I want to recommend called Mastering AWS Cost Optimization and we’ll throw links to all these in the [show notes 00:07:30], but I, unfortunately, have not read this book yet, so I can’t give strong recommendations for it, but it is very similar in style and vein to the Cloud FinOps book that we just mentioned, so might be another great resource to pick up to give you some spot learning of different components of the cloud cost management workflow and style.


Pete: Awesome. Yeah, definitely agree. I’d love to see, again, more content out here. There’s a lot of stuff that exists. And even A Cloud Guru has come up with cost management training sessions.


Again, we’d like to see more and more of this. I’d love to see more of this come from Amazon. I’d love to see—you know, they have a certification path in all these different areas; I’d love to see more of that in the cost management world because I think it’s going to become more complex, and having that knowledge, there is so much knowledge, it’s spread so far across AWS, helping more people get up to speed on it will be just critical for businesses who want to better understand what their spend is doing. So, really great question, Barry, friend of the pod. We should get some pins for that, right? Friend of the pod pins?


Jesse: Oh, yeah.

Pete: And yeah, really great question. Really appreciate you sending it and hopefully that helps you. And if not, guess what? You can go to lastweekinaws.com/QA, and just ask us a follow-up question, Barry. Because you’re a friend of the pod. So, we’ll hopefully hear from you again soon.

Jesse: Thanks, Barry.


Pete: Thanks.

Announcer: If your mean time to WTF for a security alert is more than a minute, it’s time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you’re building a secure business on AWS with compliance requirements, you don’t really have time to choose between antivirus or firewall companies to help you secure your stack. That’s why Lacework is built from the ground up for the cloud: Low effort, high visibility, and detection. To learn more, visit lacework.com.

Pete: All right, we have one more question. Jesse, what is it?


Jesse: “All right, most tech execs I speak with have already chosen a destination hyperscaler of choice. They ask me to take them there. I can either print out a map they can follow, procedural style, or I can be their Uber driver. I could be declarative. I prefer the latter for flexibility reasons, but having said that, where does one actually start?


Do you start with Infrastructure as a Service and some RDS to rid them of that pesky expensive Oracle bill? Do we start with a greenfield? I mean, having a massive legacy footprint, it takes a while to move things over, and integrating becomes a costly affair. There’s definitely a chicken and egg scenario here. How do I ultimately find the best path forward?” That question is from Marsellus Wallace? Thank you, Marsellus.

Pete: Great question. And I’m not just saying that. I guess I have a question. Or at least, maybe we have different answers based on what this really looks like. Is this a legacy data center migration?


The solution here is basically lift-and-shift. Do it quickly. And most importantly, don’t forget to refactor and clean up after you shut down your old data center. Don’t leave old technical debt behind. And, yeah, you’re going to spend a lot, you’re going to look at your bill and go, “Holy hell, what just happened here?”


But it’s not going to stay that way. That’s probably—if you do it right—the highest your bill is going to be because lift-and-shift means basically just moving compute from one location to another. And if you’re—as we spoken about probably a million times, Jesse and I, if you just run everything on EC2 like a data center, it’s the most expensive way to do the cloud stuff. So, you’re going to then refactor and bring in ephemerality and tiering of data and all those fun things that we talk about. Now, is this a hybrid cloud world?

That’s a little bit different because that means you’re not technically going to get rid of, maybe, physical locations or physical data centers, so where do you start? It’s my personal opinion—and Jesse has his own opinion, too, and guess what it’s our podcast and we’re going to tell it like it is.


Jesse: [laugh].


Pete: [laugh]. You know, my belief is, starting with storage is honestly a great way to get into cloud. Specifically S3. Maybe even your corporate file systems, using a tool like FSX. It’s honestly why many businesses start their cloud journey, by moving corporate email and file systems into the cloud.

I mean, as a former Microsoft Exchange administrator, I am thoroughly happy that you don’t have to manage that, really, anymore and you can push that in the cloud. So, I think storage is honestly a great way to get started within there: Get S3 going, move your file systems in there, move your email in there if you haven’t yet. That’s a really great way to do it. Now, the next one that I would move probably just as aggressively into and, Marsellus, you mentioned it: RDS, right? “Should we move into RDS, get rid of expensive Oracle bills?”

Yeah, anytime you can pay ol’ Uncle Larry less money is better in my mindset. Databases are, again, another really great way of getting into AWS. They work so well, RDS is just such a great service, but don’t forget about DMS, the database migration service. This is the most underrated cloud service that Amazon has in there, it will help you migrate your workloads into RDS, into Amazon Aurora. But one thing I do want to call out before you start migrating data in there, talk to your account manager—you have one even if you don’t think you have one—before starting anything, and have them help you identify if there are any current programs that exist to help you migrate that data in.

Again, Amazon will incentivize you to do it, they will provide you credits, like map credits or other investment credits, maybe even professional services that can help you migrate this data from an on-premise Oracle into AWS, I think you will be very pleasantly surprised with how aggressive that they can be to help you get into there. The last thing that I would say is another great thing to move in our data projects. So, let’s say you want to do a greenfield one, greenfield type of project into Amazon, data projects are a really great way to move in there. I’m talking things like EMR, Databricks, Qubole, you get to take advantage of Spot Fleets with EMR, but also Databricks and Qubole can manage Spot infrastructure and really take advantage of cloud ephemerality. So if, like I said, you started by pushing all your data into S3, you’re already halfway there on a really solid data engineering project, and now you get to leverage a lot of these other ancillary services like Glue, Glue DataBrew, Athena, Redshift.

I mean, once the data is in S3, you have a lot of flexibility. So, that’s my personal opinion on where to get started there. But Jesse, I know you always have a different take on these, so where do you think that they should start?

Jesse: Yeah, I think all of the recommendations you just made are really, really great options. I always like to look at this from the perspective of the theory side or the strategy side. What ultimately do these tech execs want to accomplish? Is it getting out of data centers? Is it better cost visibility?

Is it optimizing spend? Is it better opportunity to move fast, get new R&D things that you can’t get in a data center? What do these tech execs ultimately want to accomplish? And ask them. Start by asking them.

Prioritize the work that they want to accomplish first, and work with teams to change their behaviors to accomplish their goals. One of the biggest themes that we see in the space moving from data centers into cloud providers or even just growing within a given cloud provider is cost visibility. Do teams know why their spend is what it is? Do they know why it went up or down month-over-month? Can they tell you the influences and the drivers that cause their spend to go up or down?

Can they specifically call out which teams or product usage increased or decreased, and what ultimately led to your spending changing? Make sure that every team has an architecture diagram and they can explain how they use AWS, how data moves from one service to another, both within their product and to other products. Because there’s definitely going to be sharp edges with data transfer between accounts. We’ve seen this happen to a number of clients before; I’ve gotten bit by this bullet. So, talk to your teams, or talk to your tech executives and have those tech executives talk to their teams to understand what do they ultimately want to accomplish?


Can they tie all of what they’re trying to accomplish back to business metrics? Maybe a spike in user logins generated more usage? If you’re a photo storage company, did a world event prompt a lot of users to upload photos prompting higher storage costs? Are you able to pull out these specific insights? That’s ultimately the big question here. Can you boil it down to a business KPI that changed, that ultimately impacted your AWS spend?


Pete: I think this is a scenario of where you get started. Why not both? Just maybe do both of these things that we’re saying.

Jesse: Yeah.


Pete: And honestly, I think you’ll end up in a pretty great place. So, let us know how that works out, Marsellus, and thank you for the question. Again, you also can send us your questions, and we will maybe answer these on a future episode; lastweekinaws.com/QA, drop a question in there, put your name, or not or a fake name, or even a joke. That’s fine, too. I don’t know what the text limit is on the name, Jesse. Can you put a joke there? I don’t know. You know what? Test that out for us. It’s not slash QA for nothing. So, give that a little QA, or a question and answer and [unintelligible 00:17:29]. All right. Well, thanks, Jesse, for helping me out answering more questions.

Jesse: Thanks, everybody for the awesome questions.


Pete: If you enjoyed this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice and tell us, what would be the last thing that you would move to AWS? It’s QuickSight, isn’t it?

Jesse: [laugh].


Pete: Thanks, everyone. Bye-bye.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 30 Apr 2021 03:00:00 -0700
"The Sun Also Crashes: Keeping Current"

Want to give your ears a break and read this as an article? You’re looking for this link.



Never miss an episode



Help the show



What's Corey up to?

Wed, 28 Apr 2021 03:00:00 -0700
DynamoDB Streams for DynamoDB Streams
AWS Morning Brief for the week of April 26th 2021 with Corey Quinn.
Mon, 26 Apr 2021 03:00:00 -0700
Listener Questions 4

Links:


Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.


Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. We’re back again, my name is Pete Cheslock.

Jesse: I’m Jesse DeRose. So, happy to be back in the studio after our whirlwind tour of the Unconventional Guide that I feel like we’ve been on for roughly as long as the pandemic’s been going on at this point; probably a little bit less. But lots of really great content there that we were happy to talk about, and I’m happy to be moving on to some other topics.

Pete: Yeah, absolutely. And the topics, we actually get to move on to some of our favorite topics, which are answering your questions. And it turns out, Jesse, there’s more than two people that listen to us. There’s a lot of you; there are dozens of you out there, and we love it.

Jesse: You like me. You really like me.


Pete: So, great. So, great to see. We’ve been getting tons of fantastic questions, a few of which we’re going to answer right now. You can also have your question answered by going over to the lastweekinaws.com/QA and enter in your question there. You can enter in your name, or you can leave it blank, or you could just put something funny there. Anything works. We’re happy to dive in deeper on any particular topic, again, whether it’s about this recent Unconventional Guide series or just something you’re curious about in your day-to-day in your cost management life.


Jesse: Today’s questions are really great because they ultimately get at the practical side of all of our recommendations. Because I feel like every single time I subscribe to one of those self-help books or blogs and I read all these really great short, sweet tidbits, I think to myself, “This is perfect. I’ll go apply this to everything in my life.” But then doing the actual work part is so much harder. Where do you even start with that first step once you’ve got the big picture grand idea? So, today we’ve got some really, really great questions, focusing on the best ways to get started on your cloud cost management journey. So, let’s start off with these questions.



First question is, “Could you cover some practical approaches to applying some of your Cost Management Guide? A lot of your suggestions sound simple on paper, but in practice, they become quite complicated.” So, true. Absolutely, absolutely a concern. “I’ve had some success pulling in a small group of subject matter experts together for short periods of time focusing on low risk, easy things to do. How have you approached actually doing this? What meetings do you set up? What do you take for notes? How do you document your savings? How do you find new opportunities?” That’s from Brian O. Brian O., That’s a really, really great question.

The other one that I want to add to this: “We’re a big AWS shop, and I’ve spent some time inside the AWS beast in the past, and I still struggle with multi-account multi-region data transfer in general, but specifically analyzing cost and usage. There are examples specifically like if data transfer out goes up $25,000 last month, how do you attribute that? How do you know where to apply that? How do you know what ultimately prompted that spend? Love how you work through these types of challenges. What is relatively easy at a single account level gets exponentially more complex with every account and region we function in.” So, true. And that’s from Todd. Thank you, Todd. In both cases, absolutely true.

There’s this really great idea of we can give you the really short and sweet things to think about, but taking those first steps for practically applying these ideas is tough, and it needs to scale over time. And not every practice does.

Pete: Yeah, these are great questions. I, kind of, am remembering that meme that was around for a while, which was, how to draw an owl. “First, draw two circles, and then, you know, you draw the rest of the owl.”

Jesse: Yeah.


Pete: And honestly, oftentimes, some of the stuff even that we say, Jesse, feels that way, and it doesn’t intend to come across that way. It’s just, we could bore you all on a multi-hour long recording of some of these topics. I mean, we do this with our clients, and our clients pay for this pleasure [laugh] for us to put them to sleep with our soft tones of the cloud cost management world. But I think the reality is that it is complex and there are probably unlikely to be quick wins in a lot of these places. One thing that we found is honestly, monitoring, visibility, I think all the cool kids are calling it observability now—

Jesse: [laugh].


Pete: —you know, I can’t believe I’m going to say this, but CloudWatch is actually probably one of the best cloud cost reduction tools that exist out there. There are so many services within AWS that you’re probably using today, that by default, report data to CloudWatch. And those statistics are potentially a huge place to identify resources that are over-provisioned and underused, idle resources, things like that. I can’t tell you how many times that I will go into a client account, and one of the first places I go to is—after Cost Explorer—is probably CloudWatch. So, monitoring spend and monitoring what’s happening there is kind of a great way to get started on that cloud cost idea because you’re getting charged for everything that happens, so knowing what’s happening, and knowing how it’s changing over time is a great way to start understanding and reducing it.

Jesse: Yeah. And I think AWS is probably also using some of those CloudWatch metrics in their optimization recommendations that they make within their own optimization tooling. And it’s probably just not clearly defined or clearly outlined for AWS customers to be able to use the same metrics. So, I feel like if my Compute Optimizer could quickly load or link to a graph that showed me low CPU utilization across a number of instances, that’s a really handy way for me to start using more of CloudWatch’s metrics.

Pete: Yeah, I think Compute Optimizer is honestly, criminally underused out there. I don’t know why. Then honestly, one of the other complaints is like, “Well, you can’t get memory statistics unless you have a CloudWatch Agent.” Yes. So honestly, install the CloudWatch agent; have it report up, the, like, one or two memory metrics that Compute Optimizer needs to make a recommendation and the cost will more than pay for itself.


And now you can even output those statistics to S3 and do some fun programmatic stuff with it. Put those outputs in front of the engineers that own those resources and be like, “Hey, yo. This thing says, change your i3.24xl. Could you move it to something a little bit more useful, like a t3.small?”


Jesse: And these are just some practical applications for some of the specific metrics we’re talking about, but this is a practice that you might want to turn into a process, you might want to turn into an ongoing amount of work. And in a lot of cases, we’ve seen this start as one engineer who’s really interested in understanding AWS, really passionate about the bill, maybe isn’t in a leadership or management role so maybe they don’t have a direct business requirement to optimize their spend, but they’re really, really interested in this work and they grow into a role where they are taking on more and more of this work. And that’s not scalable; that engineer is going to get burned out very, very quickly if they have a day-to-day role that is focused in development and doing all of this optimization work, cost cloud management work on the side. We generally recommend at least one dedicated individual who starts building these dashboards, starts looking at some of these metrics, starts these conversations with teams, and ultimately grow that into a full team.

Pete: Right. I think that’s the biggest thing that we’re seeing in the industry is actual cloud finance teams coming into existence at companies. It’s such a critical role and it’s sad to see when people are like, “Arg, spend is out of control. We’re doubling year over year on spend and no one really seems to know why.” And honestly, it’s because no one cares about it. You don’t have any ownership on it. And, you know, we see it a lot, right? It’s like, “Well, everyone owns the Amazon bill.” That’s code for, “No one owns the Amazon bill.”

Jesse: Yeah.

Pete: But these cloud finance teams, and even the term cloud economist, as silly as it is, it’s centered in reality, which is we create financial models to understand spend and we dive into those numbers to make the usage makes sense to folks like CFOs inside of companies. Yeah, there’s a couple of ways that we have seen some of this done at scale. In one case using kind of active monitoring, and actively monitoring the spend based on really granular budgets, and reporting it as such. So, maybe you’re breaking these budgets out to be product-specific, or team specific, or business unit, or things like that, and then basically reaching out to these engineering teams. Because you are actively monitoring the spend on a recurring basis, you can reach out to those teams, when their spend goes over a given threshold that you’ve put in place, or when you, maybe, find some optimization opportunities.

You’re probably thinking to yourself, “Wow, I don’t have the time for that.” Yeah, but you need to create the time or you need to create the team for this. The companies that we work with who have a dedicated team around this are the ones that do the best. In some cases, we’ve seen having a Dedicated Cloud finance team causes the bill to actually decrease over time, which, you’re thinking to yourself, “Wow. An Amazon bill that goes down? We so rarely see that.”


Even for us, our clients come to us, we help them find optimizations. They’ll make those optimizations, but then they replace that spend with other investments. Usually, it’s new projects and new spend. But actually seeing the bill go down because of a dedicated effort of a team is still, again, amazing to see. The other side is we’ve seen maybe more of a passive monitoring or something around the background of things where you have a cloud platform team that provides abstractions and guardrails to the user.

So, you’re not really trying to actively stand in the way of users and what they’re able to do and reaching out to them in an ongoing way, but you’re abstracting away the kind of complexity of the cloud and letting them basically live in a safe space that you are controlling for them. And that’s another way that you can kind of build in some of this cloud financial knowledge where teams can get that visibility into what they’re spending and know, is this too high? Is it going out of a boundary? Is there a number that I need to keep inside of? I think these are important things that level of visibility around cost and that team’s actual charges get people to start thinking, “Well, hold on a second. We’re above budget.” Even though maybe it’s not a real budget, “We’re above a spend by 20%. We need to bring that down.” And you give them the tools they need and the dashboards to effect that change on their own.

Jesse: This idea of passive monitoring is really all about making the right thing the easy thing to do. If you, as a member of the cloud platform team or as a member of leadership who cares about cloud spend, wants to make sure that teams are managing their spend in some capacity, maybe not actively directly, but at some level make sure that there are these guide rails in place that keep them within the boundaries of what they’re ultimately able to do, or what you ultimately want them to work on. This makes it a lot easier for them to not spin up an i3 instance that they don’t ultimately need; it makes it a lot easier for them to not deploy resources that are missing tags. Put as many guardrails in place that keep the teams independently able to work within the space that they are building, and developing, and functioning, but ultimately gives them the opportunity to continue being independent and really thrive within whatever work they’re doing.

Pete: Yeah, the next thing that we recommend to everyone. And actually, we recommend it before even engaging with The Duckbill Group, you’ll get an onboarding document of things to do, and the thing we always recommend is turn on the Cost and Usage Report. If you’re listening to this and you’re like, “What’s the Cost and Usage Report?” Well, boy are you going to have a fun learning because it is a highly granular usage report of everything that you’ve ever done within Amazon, and it’s extremely powerful. The downside is that it can be hard to navigate; it takes a little time to learn.

But go turn it on; the cost is minimal; it’s the cost of storing this data in S3. Preferably when you turn this on, turn it on into Parquet format because it’ll allow you to query it with tools like Athena, or Tableau, or Looker or—God forbid—SageMaker. And this tool, this Cost and Usage Report, lets you dive in at an extremely granular level, down to the resource visibility—per hour, per resource visibility. And it’s something you have to enable, but again, highly recommend it to enable that resource level usage. Because now you can go and find out, well, for SageMaker I’m seeing a growth in spend.


Well, which resource is it within SageMaker? You can break that down really granularly. So, Cost and Usage Report is another place that, again, if you’re not using this today, if you don’t have at least a SageMaker dashboard, which costs basically nothing—a couple of dollars a month—pointed at your Cost and Usage Report, you’re missing out on some really great ways to understand the changes in spend over time.

Announcer: If your mean time to WTF for a security alert is more than a minute, it’s time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you’re building a secure business on AWS with compliance requirements, you don’t really have time to choose between antivirus or firewall companies to help you secure your stack. That’s why Lacework is built from the ground up for the Cloud: low effort, high visibility, and detection. To learn more, visit lacework.com.

Jesse: Another couple of really great options are the AWS Cost Anomaly Detection service and AWS Budgets. Both are free, which is absolutely fantastic. I highly recommend checking them out. AWS Cost Anomaly Detection, once enabled, will actually look for anomalies in your spend across different AWS services, across different cost attribution tags, across different cost categories. There’s a lot of opportunities here for you to see anomalous spend and act on it.

This can be shared with teams as soon as the anomaly occurs, through Slack notification or an email, or maybe you get email notifications on a weekly basis, or a monthly basis, or some kind of recurring basis, for all of the anomalies that you saw within a given time period. We recorded an episode about Cost Anomaly Detection a while back; highly recommend checking that episode out. It’s got a lot of really great features and recommendations for getting started.

The other one I mentioned is AWS Budgets. Again, if you’re not really sure where to start, try creating some budgets for your teams. Maybe look at the last six months of spend for each team, maybe look at spend across different tags, or team units, or business units, whatever makes the most sense for the way that your organization is set up, and create some budgets for those groups. These budgets could be for specific AWS services if you are a single team running within a single AWS account, it could be as complex as multiple business units across multiple different accounts across different parts of the organization. There’s lots of great opportunities here for you to start to better understand your spend, get better visibility into your cloud spend.

Pete: Yeah, absolutely. I think all of those are great tools that can really help you. And, Jesse, I know we’ve talked about this before. Even just monitoring your tagging, not like, “Oh, are we tagging 50% of our resources?” But you want to monitor for your untagged by spend. So, if 95% of your spend is tagged, you’re crushing it. That’s amazing. But that may only be 50% of your things.

So, I guess, care less about how many of your resources are tagged—because some of them just can’t be tagged, or are tagged in a painful way—but focus more on the money aspect of it. And that will lead you into the ability to start creating some governance strategies. And that term governance, it just—


Jesse: Oof.

Pete: —makes me feel gross. Yeah. Oh god, terrible word. But the [laugh] sad state of the world is, that’s what most companies we talk to need; they just don’t have it. When the companies that we talk to who are like, “Our spend is going up, and we’re not sure why.” Or, “How do we get our engineers to care about cost savings?” And things like that. You know, having a governance strategy, a way to react to those changes in spend in a, hopefully, automated way, is critical to helping control that spend.

Jesse: This really gets to the heart of why is cloud cost management important? It could be important for different reasons for different parts of the organization. Account structure, tagging, all of these different things can be important for different parts of the organization for different reasons. And that’s fine. The important thing is to socialize those reasons why to all the different parts of the company so that everybody understands what’s at stake.

Everybody understands how they can collaborate and create these best practices together. This really dives into the idea of behaviors and systems. I know it sounds a little bit not within the vein of engineering work, and finance, and cloud cost management, but what kind of behaviors do you ultimately want to see within your teams? What kind of actions do you want to see your engineers taking? Do you want them to start thinking about cost in all of their architecture discussions?

Do you want them to review the budgets that you’ve created for them every month? Every week? During stand-up meetings? What kind of things do you ultimately want to see them doing on a regular basis that maybe they aren’t doing right now, that maybe would ultimately help the company succeed with all of this cloud cost management work that you’re creating? And again, going back to the idea of making the right thing the easy thing to do, how can you improve the existing technical systems that you have within your organization to make the right thing the easy thing to do?


How can you change your CI/CD pipelines? How can you change the tools that you’re using for cost visibility, like Looker, or Tableau, or SageMaker, or something else, such that teams can quickly and easily self-service the information that they need to make their decisions to go about their days, go about their work more easily?

Pete: So, Jesse, you’re saying that it’s a mixture of software and culture? Kind of sounds like DevOps a little bit, doesn’t it? [laugh].

Jesse: Yeah. Yeah, it kind of does.


Pete: Yeah, it kind of does. So, you know, I think all of that is to say, it’s hard work, it’s not going to come easy, but how would we get started? Like, when we enter into an engagement with one of our clients, we’re coming in from total outsiders and we’re trying to navigate through a company with complex communication structures, and maybe teams that are entrenched in different ways. How do we get started? Well, we dive in; we start with big numbers, right?


What are your top ten places your money goes, just by service? I’ll answer it for you. It’s probably EC2, S3, RDS, and then dealer’s choice for the last ones, maybe data transfer, maybe Lambda, if you’re really weird. And if Lambda is in your top five, you should absolutely give us a call because that should not be the case. [laugh].


But start with those big numbers, understand where the money is going. But then go to the next level in. Okay, within EC2, where is your spend going? Or the dastardly EC2 ‘other cost’ category; okay’s the money going? Is it in regional data transfer, which is also what’s called cross-AZ data transfer? Is it in your NAT gateways? Why?


That’s the next question. Why is the spend high in that area? You may not be able to understand because it may not be tagged—we find that a lot—but start asking questions. And that’s what we do: We start reaching out to technical folks within the company. We’d say, “Hey, we see you’ve got a high amount of usage on EMR, but they’re all clusters that are running 24/7. They’re not scaling up and down as the jobs are happening. Who knows more about EMR?” And we just start asking questions. And we’re asking them, “Well, are you doing anything on the cost optimization side? Have you tried to do anything cost optimization-wise to reduce it, and you haven’t been able to? How does this infrastructure scale? Does it scale linearly with the number of users? Does it scale in a different way? Who are the consumers?”

And then you kind of even go another level down to see, do you find anything that just looks odd? I saw on one account for a client we were working with, VPC costs were just extremely high, much higher than I’ve ever seen before. What was interesting is that the cost was not data-transfer related; it was the pure number of endpoints that they had created that that cost far outweighed any other costs to data transfer; there was just a piece of technical debt that they were aware of, but the structure of their multi-accounts, they just couldn’t do anything about it. But again, you’re looking for things like that. And you know that you are doing a good job if, essentially, you can get to the end of this process—which could take months and it could take years depending on your scale—is if you can answer this question that if your customers, or users, or consumers of the applications on your cloud service if they increased by 200%, 500%, 1,000%. What would happen to your cloud spend? How would it change? That’s the end game you’re trying to get to. That’s the unit economics, the unit economic model and forecasting, and now you’re a superhero because now you can answer that question that not a lot of people are able to answer with their cloud usage.

Jesse: I also want to add that, as you’re asking questions, you’re going to find teams that specifically will tell you, “We created this infrastructure in this way because security told us to,” or, “Because our business requirements say that we have an SLA that means we need to keep data for this amount of time at this level of availability.” And that’s totally fine. That doesn’t mean that you need to necessarily change those requirements. But now you might have a dollar amount for those business decisions. Now, you might ultimately be able to say, okay, our product SLA may say that we need to keep data for 90 days, but keeping data for 90 days, that business decision is costing us hundreds of thousands of dollars every month because of the sheer volume of data that we now have to keep. Is that something that we ultimately are okay with? And are we okay spending that much money every month to keep this business decision, or do we need to revisit that business decision? And that’s only something that you and your teams can decide for yourselves.

Pete: Awesome. These are great questions. You could also send us a question lastweekinaws.com/QA. We would love to spend some time diving into it and just helping you out and helping you get through your day. That’s what we’re here for.

If you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review. Give it a five-star review on your podcast platform of choice, and tell us what is your favorite EC2 instance to turn off for your engineers.

Jesse: [laugh].


Pete: Thanks, everyone.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 23 Apr 2021 03:00:00 -0700
S3's Durability Guarantees Aren't What You Think

Want to give your ears a break and read this as an article? You’re looking for this link.



Never miss an episode



Help the show



What's Corey up to?

Wed, 21 Apr 2021 03:00:00 -0700
AOS Engineering
AWS Morning Brief for the week of April 19, 2021 with Corey Quinn.
Mon, 19 Apr 2021 03:00:00 -0700
Listener Questions 3 - How to Get Rid of Your Oracle Addiction

Links:

Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.



Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I am Pete Cheslock.


Jesse: I’m Jesse DeRose.


Pete: We’re coming at you again with some more listener questions from the Unconventional Guide to AWS Cost Management. I’m excited. People are listening to us, Jesse.


Jesse: This is fantastic. I’m really excited that we have one fan. I’ve always wanted one fan.


Pete: Well, two fans now. Maybe even more because we keep getting questions. And you can also be one of our Friends of the Pod by going to lastweekinaws.com/QA. And you can give us some feedback, you can give us a question and, like, will totally answer it because we like Friends of the Pod.


Jesse: We may or may not enter you into a raffle to get a Members Only jacket that’s branded with ‘Friends with the Pod.’


Pete: We should get some pins made, maybe.


Jesse: Ohh…


Pete: I think that's a good idea.


Jesse: Yeah.


Pete: So, what are we answering today, or attempting to answer for our listener, Jesse?


Jesse: So today, we’ve got a really great question from [Godwin 00:01:20]. Thank you, Godwin, Godwin writes, “I truly believe that the system that I support is, like, a data hoarder. We do a lot of data ingestion, we recently did a lift-and-shift of the system to AWS, we use an Oracle database. The question is, how do I segregate the data and start thinking about moving it out of traditional relational databases and into other types of databases? Presently, our method is all types of data goes into a quote-unquote, ‘all-purpose database,’ and the database is growing quite fast. Where should I get started?”

Pete: Well, I just want to commend you for a lift-and-shift into Amazon. That’s a Herculean feat, no matter what you’re lifting and shifting over. Hopefully, you have maybe started to decommission those original data centers and you don’t just have more data in twice as many locations.

Jesse: [laugh]. But I also want to call out well done for thinking about not just the lift-and-shift, but the next step. I feel like that’s the thing that a lot of people forget about. They think about the lift-and-shift, and then they go, “Awesome. We’re hybrid. We’re in AWS, now. We’re in our data center. We’re good. Case closed.” And they forget that there’s a lot more work to do to modernize all those workloads in AWS, once you’ve lifted and shifted. And this is part of that conversation.

Pete: Yeah, that’s a really good point because I know we’ve talked about this in the past, the lift-and-shift shot clock: when you don’t start migrating, start modernizing those applications to take advantage of things that are more cloud-native, the technical debt is really going to start piling up, and the folks that are going to manage that are going to get more burnt out, and it really is going to end poorly. So, the fact you’re starting to think about this now is a great thing. Also, what is available to you now that you’re on AWS is huge compared to a traditional data center.


Jesse: Yeah.


Pete: And that’s not just talking about the—I don’t even know if I’ve ever counted how many different databases exist on Amazon. I mean, they have a database for, at this point, every type of data. I mean, is there a type of data that they’re going to create, just so that they can create a database to put it into?


Jesse: Wouldn’t surprise me at this point.


Pete: They’ll find a way [laugh] to come up with that charge on your bill. But when it comes to Oracle, specifically Oracle databases, there’s obviously a big problem in not only the cost of the engine, running the database on a RDS or something to that effect, but you have licensing costs that are added into it as well. Maybe you have a bring-your-own-license or maybe you’re just using the off-the-shelf, but the off-the-shelf, kind of, ‘retail on-demand pricing’ RDS—I’m using air quotes for all these things, but you can’t see that—they will just have the licensing costs baked in as well. So, you’re paying for it—kind of—either way.


Jesse: And I think this is something also to think about that we’ll dive into in a minute, but one of the things that a lot of people forget about when they move into AWS says that you’re not just paying for data sitting on a piece of hardware in a data center that’s depreciating, now. You’re paying for storage, you’re paying for I/O costs, you’re paying for data transfer, to Pete’s point, you’re also paying for some of the license as well, potentially. So, there’s lots of different costs associated with keeping an Oracle Database running in AWS. So, that’s actually probably the best place to start thinking about this next step about where to get started. Think about the usage patterns of your data.


And this may be something that you need to involve engineering, maybe involve product for if they’re part of these conversations for storage of your product or your feature sets. Think about what are the usage patterns of your data?


Pete: Yeah, exactly. Now, you may say to yourself, “Well, we’re on Oracle”—and I’m sure people listening are like, “Well, that’s your problem. You should just move off of Oracle.” And since you can’t go back in time and undo that decision—and the reality is, it probably was a good decision at the time. There’s a lot of businesses, including Amazon, who ran all of their systems on Oracle.


And then migrated off of them. Understanding the usage patterns, what type of data is going into Oracle, I think is a big one. Because if you can understand the access patterns of the types of data that are going in, that can help you start peeling off where that data should go. Now, let’s say you’re just pushing all new data created. And we don’t even know what your data is, so we’re going to take some wild assumptions here on what you could possibly do—but more so just giving you homework, really—thinking about the type of data going in, right?


If you’re just—“I’m pushing all of my data into this database because someday we might need to query it.” That’s actually a situation where you really want to start thinking of leveraging more of a data warehouse-style approach to it, where you have a large amount of data being created, you don’t know if you’re going to need to query it in the future, but you might want to glean some value out of that. Using S3, which is now available to you outside of your data center world, is going to be super valuable to just very cheaply shove data into S3, to be able to go back in later time. And then you can use things like Athena to ad hoc query that data, or leverage a lot of the ingestion services that exist to suck that data into other databases. But thinking about what’s being created, when it is going into places is a big first step to start understanding, well, how quickly does this data need to come back?


Can the query be measured in many seconds? Can it be done ad hoc, like in Athena? Does it need to be measured in milliseconds? What’s the replication that needs to happen? Is this very valuable data that we need to have multiple backups on?

Is it queried more than it’s created? Maybe you need to have multiple replica reader databases that are there. So, all these types of things of really understanding just what’s there to begin with, and it’s probably going to be in talking to a lot of engineering teams.


Jesse: Yeah, you can think about this project in the same way that you might move from a monolith to a microservice architecture. So, if you’re moving from a monolith to a microservice architecture, you might start peeling away pieces of the monolith, one at a time. Pieces that can easily be turned into microservices that stand on their own within the cloud, even if they’re running on the same underlying infrastructure as the monolith itself within AWS. And then, as you can pull those pieces away, then start thinking about does this need to be in a relational database? Does this need to have the same amount of uptime and availability as the resources that are sitting in my Oracle Database right now?


All those things that Pete just mentioned, start thinking about all of those components to figure out where best to pull off the individual components of data, and ultimately put them in different places within AWS. And to be clear, there’s lots of great guides on the internet that talk about moving from your Oracle database into, gosh, just about any database of choice. AWS even has specific instructions for this, and we’ll throw a link in the [show notes 00:09:02].


They really, really want you to move this data to RDS Aurora. They go through painstaking detail to talk about using the AWS schema conversion tool to convert your schema over; they talk about the AWS database migration service to migrate the data over, and then they talk about performing post-migration activities such as running SQL queries for validating the object types, object count, things like that. I think that a lot of folks actually don’t know that the database migration service exists, and it’s something worth calling out as a really powerful tool.


Pete: Yeah, the Amazon DMS service is honestly I think, a super-underrated service that people just don’t know about. It has the ability to replicate data from both on-premises databases to Amazon databases but also databases already running on Amazon. You could replicate from a database running on EC2 into Aurora. You could replicate that into S3—you know, replicate data into S3 that way, bringing things into sync—replicate that data into S3, and then maybe use it for other purposes. It can replicate data from DocumentDB into other sources.

So, they’re clearly doing a big investment in there. And to Jesse’s point, yeah, Amazon really wants this data. So, talk to your account manager as you’re testing out some of these services. Do a small proof of concept, maybe, to see how well it works, if you can understand the queries, or you can point your application over at an Aurora database with some of this data migrated in; that’s a great way to understand how well this could work for your organization. But as Jesse mentioned, they do want that data in Aurora.

So, if it turns out that you’re looking at your—you know, migrate some data in there, and it’s starting to work, and you’re kind of getting a feel for the engineering effort to migrate there, stop. Talk to your account manager before you spend any more money on Aurora because it’s very likely that they can put together a program—if a program doesn’t already exist—to incentivize you to move that data over; they can give you subject matter expertise; they can provide you credits to help you migrate that data over. Don’t feel like you have to do this on your own. You have an account team; you should definitely reach out to them, and they will provide you a lot of help to get that data in there. They’ve done it for many of their other clients, and they’re happy to do it for you because they know that, long term, when you move that data to Aurora, it’s going to be very sticky in Aurora.


You’re probably not going to move off of there. It’s a long game for them; that’s how they play it. So, check out those services; that could be a really great way to help you get rid of your Oracle addiction.


Jesse: Yeah, and if you’re able to, as we talked about earlier, if you’re able to identify workloads that don’t need to run in a relational database, or don’t need to run in, maybe, a database at all, for that matter, stick that data in S3. Call it a day. Put them on lifecycle management policies or different storage tiers, and use Athena for ad hoc queries, or maybe Redshift if you’re doing more data warehouse-style tasks. But if that data doesn’t need to live in a relational database, there are many cheaper options for that data.


Pete: Exactly. But one last point I will make is don’t shove it into MongoDB just because you want to have schema-less, or—

Jesse: Please.

Pete: —think about what you’re going to use it for, think about what the data access patterns because there is a right place for your data. Don’t just jump into no-SQL just ‘cause because you’ll probably end up with a bigger problem. In the long run.

Corey: If your mean time to WTF for a security alert is more than a minute, it's time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you're building a secure business on AWS with compliance requirements, you don't really have time to choose between antivirus or firewall companies to help you secure your stack. That's why Lacework is built from the ground up for the Cloud: low effort, high visibility and detection. To learn more, visit lacework.com.

Pete: So Jesse, I’m looking at our list of questions. And it turns out, we have another question.

Jesse: Ohh.


Pete: Two questions came in.


Jesse: You like me, you really like me!


Pete: It’s so great. Again, you can also send us a question, lastweekinaws.com/QA. You can go there, drop in a question and feel free to put your name. Or not; you can be anonymous, it’s totally fine. We’ll happily answer your question either way. So Jesse, who is our next question from? What is this one about?


Jesse: This one’s from [Joseph 00:13:19]. They write in, “Hey, folks. Love the show. Longtime listener, first-time caller.” Thank you. “I would love to know how people manage their costs in AWS Batch. Jobs themselves can’t be tagged for cost allocation, which makes things a bit complicated.” Lord Almighty, yes, it does. “How best should I see if the jobs are right-sized? Are they over-provisioned in terms of memory or compute? What’s the best way to see if EC2 is my better choice, versus Fargate, versus other options? How can I tell if the batch-managed cluster itself is under-utilized?”


Pete: Oof. This is a loaded question with a lot of variables.


Jesse: Yeah. And so we’re going to break it down because there’s definitely a couple questions here. But I want to start off with what AWS Batch is, just really quick to make sure everybody’s on the same page here. AWS Batch, effectively, is a managed service in AWS that schedules it and runs your batch computing jobs on top of AWS compute resources. Effectively, it does a lot of the heavy lifting configuration for you so you can just focus on analyzing the results of those queries.

Pete: Yeah, exactly. And Batch supports a really wide variety of tooling that can operate this, and that’s why it’s hard for us to give, specifically, how you might optimize this, but I think some of the optimizations actually mirror a lot of the optimizations we’ve done with optimizing EMR clusters and things of that nature, where you’re running these distributed jobs. And you want to make sure that if you’re running straight off of EC2 instances, then you want to make sure that they are essentially maxed out. If the CPU is anything less than 100% for an on-demand instance, then there’s wasted, or there’s opportunity for improvement. And so making sure that your jobs are sized appropriately and balancing out memory and CPU so that, effectively, you’re using all of the memory and all of the CPU, that’s a real basic first step.


But honestly, a lot of folks kind of miss out on that. They just kind of run a job and go off and do their own thing. They never really go back and look at those graphs. You can go to CloudWatch, they’re all going to be there for you.


Jesse: Yeah. And to this point, there’s always an opportunity to make these workloads more ephemeral. If you have the opportunity to make it more ephemeral, please, please, please, please, absolutely do so. Unless your batch job needs to run 24/7. We’ve seen that in a few cases where they have, essentially, clusters that are running 24/7, but they’re not actually utilized regularly; the workloads are only scheduled for a short amount of time.


So, if you don’t need those batch jobs running 24/7, please, by all means, move to more ephemeral resources, like Fargate. Fargate on Spot, Spot Instances in general, or even Lambda, which AWS Batch now supports as well.


Pete: Yeah, it has some step function support, which is pretty interesting. Yeah, this is a great opportunity to aggressively—aggressively—leverage Spots, if you’re not currently today. The reality is that check out Fargate on Spot if you don’t need, like, a custom operating system, you don’t need a custom EBS volume size. If you do, then EC2 on Spot is probably the best option that you really have. But really do not want to be running anything on on-demand instances. Even on-demand instances with a really good savings plan, you’re still leaving money on the table because Spot Instances are going to be a lot cheaper than even the best savings plan that’s out there.


Jesse: And I think that’s a good point, too, Pete, which is if you do need to run these workloads on-demand, 24/7, think about if you can get away with using Spot Instances. If you can’t get away with using Spot Instances, at least purchase a savings plan if you don’t do anything else. If you take nothing else away from this, at least make sure that you have some kind of savings plan in place for these resources so that you’re not paying on-demand costs 24/7. But in most cases, you can likely make them more ephemeral, which is going to save you a lot more money in the long run.

Pete: Yeah, exactly. That’s the name of the game. I mean, when we talk to folks on Amazon, the more ephemeral you can make your application—the more you can have it handle interruption—the less expensive it will be to operate. And that goes from everywhere from Spot Instances and how they’re priced, right? If you just get a normal Spot Instance, it will have a really aggressive discount on it if you need zero time in advance before interruption.


So, if that instance can just go in at any second, then you’ll get the best discount on that Spot Instance. But if your app needs a little time, or runs for a defined period of time—let’s say your app runs for one hour—you can get a defined duration Spot of one hour, you’ll get a great discount still and you’ll only pay for however long you use it, but you will get that resource for one whole hour, and then you’ll lose it. If that’s still too aggressive, there’s configurable options up to six hours. Again, less discount, but more stability in that resource. So, that’s the trade-off you make when you move over to Spot Instances.


Jesse: So, I also want to make sure that we get to the second part of this question, which is about attributing cost to your AWS Batch workloads. According to the AWS Batch documentation, you can tag AWS Batch compute environments, jobs, job definitions, and job queues, but you can’t propagate those tags to the underlying resources that actually run those jobs. Which to me, kind of just defeats the point.

Pete: Yeah. [sigh]. Hashtag AWS wishlist here. You know, again, continuing to expand out tagging support for things that don’t support it. I know we’ve seen kind of weird inconsistencies, and just even, like, tagging ECS jobs and where you have to tag them for they’re to apply.

So, I know it’s a hard problem, but obviously, it’s something that should be continually worked out on because, yeah, if you’re trying to attribute these costs, you’re left with the only option to run them in separate Amazon accounts, which solves this problem, but again, depending on your organization, could increase just the management overhead of those. But that is the ultimate way. I mean, that is the one way to ensure 100% of costs are encapsulated to a service is to have them run in a dedicated account. The downside being is that if you have a series of different jobs running across a different, maybe, business units, then obviously that’s going to break down super quick.

Jesse: Yeah, and it’s also worth calling out that if there’s any batch jobs that need to send data to different places—maybe the batch job belongs to product A, but it needs to send data to product B—there’s going to be some amount of data transfer either across regionally or across accounts in order to share that data, depending on how your organization, how your products are set up. So, keep in mind that there are potentially some minor charges that may appear with this, but ultimately, if you’re talking about the best ways to really attribute costs for your AWS Batch workloads, linked accounts is the way to go.

Pete: Yeah. If you need attribution down to the penny—some of our clients absolutely do. For invoicing purposes, they need attribution for business unit down to the penny. And if you’re an organization that needs that, then the only way to get that, effectively, is segmented accounts. So, keep that in mind.

Again, until Amazon comes out with the ability to get a little bit more flexible tagging, but also, too, feel free to yell at your account manager—I mean, ask them nicely. They are people, too. But, you know, let them know that you want this. Amazon builds what the customers want, and if you don’t tell them that you want it, they’re not going to prioritize it. I’m not saying if you tell them, you’re going to get it in a couple of months, but you’re never going to get it if you don’t say anything. So, definitely let people know when there’s something that doesn’t work the way you expect it to.


Jesse: Absolutely.

Pete: Awesome. Wow. Two questions. I feel it’s like Christmas. Except—

Jesse: [laugh].


Pete: —it’s Christmas in almost springtime. It’s great. Well, again, you, too, can join us by being a Friend of the Pod, which Jesse really loves that one for some reason. [laugh].


Jesse: Yeah. Don’t know why, but it’s going to be stuck in my brain.


Pete: Exactly. You too can be a Friend of the Pod by going to lastweekinaws.com/QA and you can send us a question. We would love to spend some time in a future episode, answering them for you.


If you’ve enjoyed this podcast, please go to lastweekinaws.com/review. Give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice and tell us why you want to be a Friend of the Pod. Thank you.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 16 Apr 2021 03:00:00 -0700
Machine Learning is a Marvelously Executed Scam

Want to give your ears a break and read this as an article? You’re looking for this link. https://www.lastweekinaws.com/blog/Machine-Learning-is-a-Marvelously-Executed-Scam



Never miss an episode



Help the show



What's Corey up to?

Wed, 14 Apr 2021 03:00:00 -0700
Suspiciously Warm Pools
AWS Morning Brief for the week of April 12, 2021 with Corey Quinn.
Mon, 12 Apr 2021 03:00:00 -0700
Predict Your Future (and Make Your CFO Happy)

Links:


Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.



Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I am Pete Cheslock.



Jesse: I’m Jesse DeRose.



Pete: We’re back again. And we’re here. We made it, Jesse.



Jesse: I was worried. This was a journey. Thank you, everybody, for coming on this journey with us.



Pete: It was quite an experience going through the Unconventional Guide to AWS Cost Savings. We’ve made it. I just can’t believe we’re here.



Jesse: Yeah.



Pete: So, what are we talking about today for the culmination of our magnum opus of cost savings optimizations?



Jesse: This is a fun one. And I know I keep saying that this is my favorite about everyone, but I have to admit that this one, this topic today probably is my absolute favorite. This one I get really nerdy over. Today, we’re talking about how to predict your future and make your CFO happy. No—spoiler alert—there are not any crystal balls involved in this one. There’s no stock market conversations.

This is talking about how you can use all of the different things that we’ve talked about throughout the course of this Unconventional Guide to really bring it all together into a couple ideas that will help you better understand your cloud costs, and really better understand your business, I think.



Pete: Yeah. All of the things we talked about really lead up to this one, which is the clients of ours that are the most mature, who are incredibly optimized in their Amazon usage, are the ones who have adopted a majority of these specific items. They all lead to this last one, that ability to predict your future usage based on something that’s happening internally, or if a salesperson comes to you and says, “Hey, we’re about to close this deal, but I need to discount our service.” People are going to start wanting to know well, what is the cheapest that you could sell your service for and still have a positive gross margin?




Jesse: Yeah. So, if you’ve done a lot of the things that we’ve talked about in the last couple episodes—I apologize, I know homework’s not the best for a podcast—but if you’ve had the opportunity to work on some of those things, you should have a ton of valuable insights into your spend. We’re talking about tagging, and showback models in particular, maybe even a chargeback model. But you can ultimately use all of this data to better understand what is your forecasted spend is going to look like with a new potential customer coming onto the platform? Or if you get into the topic that we’re going to talk about today, which is mostly unit economics, you can really understand how much can I discount my service and still make a profit, like Pete mentioned?



Pete: Yeah, I mean, imagine there’s a global pandemic that happens, and it causes your usage to spike by 500% within the course of a month. How did your spend change? Do you know where it changed? And did it change in ways that you were expecting it to? Like, my databases grew by a lot, and this other thing didn’t grow by very much.



Like, that would be expected. But also another thing that—a question that we actually like to ask a lot of our clients, if your sales just doubled overnight, okay would your spend change? Where are the places that are most expensive to operate your service? And again, this is kind of generic. I’ve worked in a lot of SaaS services, so I always think of sales, but just think of whether you’re using the cloud for a SaaS service that you provide and sell, like, B2C, things like that, or B2B, you still have users.




They might be internal users. Well, what if your users doubled overnight? What if half the company was using your internal service and now the whole company is? How does that change your usage?




Jesse: And it’s also important to think about not just your AWS usage, but all of the other services that you use that support your overall business model: things like monitoring and observability tools, logging vendors, maybe third-party sim tools. All of these are affecting your overall total infrastructure cost and are all part of this conversation. So, it’s really important to start thinking about those architecture diagrams. Remember, when we said, way, way back at the beginning of this conversation, to overlay costs on top of your architecture diagram, understanding that, understanding what parts of your product or what parts of your architecture are the most expensive will really help you understand what’s going to change?



Pete: Yeah, let’s say you’ve got a six-figure bill to Datadog or one of the big log management vendors out there, but inside of that bill, is that all just evenly spread across the whole business? What if your log vendor was—the entire spend was all by one service that some developer left the debug logging enabled for? You know, you’d want a way of understanding that maybe that spend was concentrated in maybe a non-production aspect of your account. Because then again, that wouldn’t grow, right? That wouldn’t affect your growth in your sales the same way as if maybe all of your services were equally sending logs of a certain volume over.



So, all of those extra services, they all add up, and we see it more and more, as more of our clients start adopting more than just Amazon services: they might be adopting a Snowflake, they might be adopting third-party services running databases running in other services, or EMR type workloads that are not on EMR, and they’re running on Qubole or things like that. There’s just a lot of these services that more and more people are consuming from that fall outside of just the AWS invoice.



Jesse: And this also gets back to not just architecture diagrams, but also tagging and showback models, cost visibility, really understanding where your spend is going. And this is fantastic to understand where your spend is going, but finance is probably going to want something a little bit more than this. It’s not just about how much are we spending, or where are we spending it, and maybe it’s not even a finance question. Maybe this is a sales conversation, assuming that you’re a SaaS company. Maybe this is, as Pete mentioned before, “Hey, we want to understand where can we provide discounts? What services can we ultimately discount to negotiate getting new customers on the platform?”



Pete: So, Jesse, we hear a lot of these terms a lot, and I’d love a ‘explain like I am a five-year-old’ version of it, but we hear a chargeback. And we hear showback, and honestly, I’ve never worked at those massive companies where you might implement these things, but can you give us just a real quick—for all the listeners out there, when we say showback, what does that mean? And when we say chargeback, what does that mean?



Jesse: So, a showback model essentially takes all of your cloud costs, all of your total infrastructure spend, your AWS spend, all the third-party spend, and it shows every team, every product, every microservice, maybe, depending—or maybe even business unit, depending on how your organization is split up—it shows each one of those units, how much they are actually spending, how much they’re actually using these different cloud vendors. So, this is where tagging comes in super handy because if you’ve tagged all of your taggable resources, and properly attributed all of your cloud costs with tagging and linked accounts, you have a very clear idea of who’s spending what. You know very clearly, maybe 70 to 80% of your total infrastructure spend is related to one particular product because all the cost is attributed to one particular product. And maybe that’s something you didn’t know before. Maybe now you know okay, maybe that product needs to be a little bit more expensive so that we can make sure that we are making money off of it, or profiting off of it, whereas other services can be discounted because they’re not as expensive.



Whereas in a chargeback model, you are ultimately not just showing each of these teams, hey, here’s how much you spent on AWS and Datadog usage and all these other vendors every month, you’re actually charging them for that usage. You’re actually pulling their cloud costs from their budget.



Pete: Yeah. They might actually have a budget of money. It’s all—if you want to really explain like I’m five, it’d be like, I give my child their—they get $1 for all of the tasks that they do throughout the week, I don’t actually give them the money because I usually have to subtract out their, like, Roblox spend of the week [laugh] and things like that. It’s all virtual, but at the end of the day, you know, we’re kind of virtually giving this business unit some money, and then, kind of, virtually charging them for their services within.



Jesse: Yeah. And this is mature. We don’t see a lot of companies doing this. This is hard because you have to take other steps first to get here. And so this is why we harp so much on cost attribution through tagging and through linked accounts.



This is why we harp so much on cost visibility and overlaying those cloud costs on your architecture diagrams to understand all of this data to lead to this point, which is understanding, where, how much is my primary product actually costing us? How much is my secondary product actually costing us? Or maybe how much is this business unit costing us in terms of total infrastructure spend?



Pete: Yeah, I mean, I can kind of share my history with this at previous companies is that, again, eventually someone in the financial department is going to say, “What was our cost for Amazon?” They specifically will want to know the production cost because that figures into a term called gross margin, which you often hear at SaaS businesses. Gross margin is basically you take all the revenue that came in and you subtract away what it took to support that revenue. And mostly, that is just the Amazon bill and these other vendors, perhaps, and you end up with a percentage. And hopefully, it’s a positive percentage.



It means you’re theoretically making money at a gross, I mean, obviously, before you pay salaries and all those other items, but that being kind of beside the point for now, that number, you’re probably going to get asked for. So, you wouldn’t want to give like your straight Amazon bill, like, “Oh, well, we spent $100,000 last month,” because some of that spend was probably in research and development; it was probably in a development account or a QA account. You really just want your product spent. So, at a previous company, the first step we took was break out our spend via production and development, just two criteria. Now, for us because we started with just a handful of accounts—this was before a lot of accounts were more prevalent, before organizations—before it was easy to handle a lot of accounts—we had a Prod and a Dev. Super easy. Look at Prod, look at Dev. There’s the two bills.



But then as time went on, we needed to get more granular. We were running some development workloads, testing out new databases at scale in kind of a hidden dark deployed mode, in production. Well, we want to subtract that spend from there. And that requires tagging. I mean, that’s why we really harped on tagging for a couple of episodes because tagging is the only way you’re going to be able to do that.




Now, we see more often a lot of our clients do maybe an account per product, or account per business unit. Those are, again, really effective ways to corral your spend to make it really easy to break it out and add it up. It’s really just trying to break it down to the most reasonable spend unit possible that you can then play around with and adjust. Mostly to go back to your CFO when they asked you, “Hey, I need to know this specific answer.” You’ve got it hopefully somewhat available.



Jesse: Okay, so this is where we’re going to start talking about unit economics. And hopefully, your eyes will not glaze over. I want to make sure that—this is important, this is really actually beneficial. It’s not just a specific economic thing that you learned back in Econ 101. This is actually going to be useful and beneficial for your organization.



So, unit economics describes your product in terms of revenues and costs in relation to a unit KPI—that’s where the ‘unit’ term comes from in ‘unit economics’—that tracks closely with customer demand. So, that’s a really gross definition, I know, and I apologize.



Pete: You know, and we can even extend that a little bit further and give some good examples. Like, maybe if you are a website that provides eLearning services, your unit might be the number of daily active users or thousands of daily active users, right, could be a thing. That could be a unit that you’re selling. I actually worked at a SaaS company where we sold a piece of software that would run per server, and we broke our unit down to the servers—the things that we sold—down to that level.




Jesse: Yeah, if you are in the airline industry, for example, your unit would probably be every passenger. How many passengers are you able to sell tickets to on every plane? What do those costs look like?



Corey: If your mean time to WTF for a security alert is more than a minute, it's time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you're building a secure business on AWS with compliance requirements, you don't really have time to choose between antivirus or firewall companies to help you secure your stack. That's why Lacework is built from the ground up for the Cloud: low effort, high visibility and detection. To learn more, visit lacework.com.



Pete: And you don’t need just, like, one unit. Maybe you have one unit for your whole platform like the whole gross production spend breaks down into one specific unit, you could do that. But you could also have units at a per-service level because maybe it’s like you’re processing a lot of documents. I worked for an email archiving company, forever ago, and our unit was the amount of emails that were indexed and archived so we could figure out, we might have one customer who just didn’t generate a lot of emails, but they had tons of users. Well, one of our units was the volume of emails that we were indexing and archiving for that customer, whereas on the flip side, if maybe our spend was driven more by user count, and not document count, maybe that’s what we want our unit to be, is per user.



Jesse: Yeah. It’s really important to call out that you might have a single easy-to-define unit; you might have a more complex relationship that’s weighted with a couple different factors of different components of the architecture. But ultimately, your unit KPI and how you break out your costs to support your customers will be unique to your overall business.




Pete: Exactly. And this is where you’re only going to find this answer out with a lot of conversations, internally. It could come to you pretty easily, you know, just based on how your business is. But I think for a lot of folks using Amazon, especially if you’re just in a specific business unit inside of a broader business, it could be a little bit more challenging to figure out. But what you’re really trying to do is just figure out, when X changes, our spend changes, and we spend more or we spend less. Try to solve for X. That’s really what you’re trying to do.



Jesse: Okay. So, now we’ve covered the unit KPI part of this conversation. Awesome. So, we’re done, Pete, right? We just take our AWS bill and then—



Pete: Yeah.



Jesse: —divide it by the unit and we’re done.



Pete: So, easy. I’ve got my unit. I’ve got my bill. I got an iPhone that can do a calculator. Good to go.



Jesse: [laugh]. Good. We’re done. Well, wait. What about if I have multiple AWS accounts? Wait, what if I have multiple different products?



Pete: Yeah, that’s… I mean, I kind of calculator. I mean, I might be here all day, but…



Jesse: [laugh]. I’ve got a whiteboard. We’ve got some time.



Pete: Yeah, we got time. That’s a great point, though. Again, what if you do have things that are just spread all over the place? What if you’ve got two different products, two different services inside the same account? Because of course, you would. That’s a super normal thing. I’m not even saying that sarcastically. That’s a super normal thing.



Jesse: Absolutely.



Pete: Well, how do you handle this? How do you handle shared services?



Jesse: Yeah. I mean I—



Pete: We could go on for too long on that one, but these are questions you really want to start asking.



Jesse: Yeah. And remember that you’re potentially going to have different unit KPIs for different products, for different business units. That’s fine. That’s expected. But make sure that you are measuring appropriately for each of those.



The incoming costs, the incoming revenues, and costs for each of these isn’t going to change. That’s coming from your tagged usage and your linked accounts, but maybe the unit that you’re dividing that spend by is going to change, and that’s fine. This is where a spreadsheet comes in super, super handy. I love my Excel spreadsheets for this. Very, very easy to just bring in all of the bill data across different accounts, and really clearly attribute this spend is for the service, or the spend is for this product, or the spend is for this business unit, and then divide that by the unit that we have in question to get your actual unit KPI, to get your unit economic metric.



Pete: Yeah, and this is where the superpower comes in. Once you have this number, now you can better understand and make product-level decisions. Again, whether you’re a SaaS product with a product you’re selling to external customers or building an internal tool, your product is the thing that the internal users consume. Your product decisions can now be driven by this. I mean, I have recollections of conversations with product teams, where they would talk about certain services internally, how they wanted to expand and do all the stuff with it.



And I said, “Well, right now that one service represents one-third of our total spend, right? Our gross margin, that is one-third. But we looked at the users, and it’s only being used by one percent.” When you have these big numbers and saying, “Wow, the company spends a third of their money on something one percent of people use,” then maybe that’s not the place we want to be investing product decisions into. Maybe it is, but you don’t know enough to have that conversation unless you have this data.



Jesse: Absolutely. I think there’s one other small caveat that we haven’t touched on that I do want to call out, and this comes back to your conversation about tagging. We have noticed that a lot of teams want to tag to a certain extent, and then start building their showback models immediately, which is great that you’ve got investment, you’ve got energy, you really want to get to that showback model, get to that chargeback model, that unit economic model space. But if your usage or cloud usage is not thoroughly tagged and accurately tagged, your resulting data is not going to be accurate either. So, we think about this in terms of a cost margin or a cost error.



So, for example, if your production spend or your production usage is only 60% tagged, that means you’ve got 40% error in that data that’s coming in; your cloud spend for production has 40% error margin, which is huge.



Pete: Yeah, exactly. Track your untagged spend, as well as your tagged spend. I mean, make sure you have a story for the things that are not tagged. That includes things like data transfer and things that maybe are not as taggable within AWS. That’s an important aspect of this that you’ll want to make sure you’re at least not forgetting about.



Even if you can’t tag it, you don’t have a solution for it, make sure it’s in the back of your head that this is maybe not as accurate of a forecast because we’re just taking data transfer and dividing it by product versus actually looking at which product uses the most to transfer.



Jesse: Yeah, and this is a tough concept, so don’t feel bad if you listen to this episode, again, don’t feel bad if you go download the Unconventional Guide from the Duckbill Group website—we’ll have the link in the [show notes 00:20:12]—this one is a tough concept because it brings in a lot of other moving parts to ultimately get at this one unifying really, really important idea. This is one that we see a lot of clients and potential clients struggle with. So, if you’re taking some time to understand this concept, you’re not alone.



Pete: Exactly. This is the goal of all of the previous work, and this is something that you would measure in just a multi-year commitment in most businesses. And the larger the business is, the longer that work is going to take because it’s hard, there are a lot of moving pieces, and so many things need to be done in advance of all of this. And again, realize you’re not doing this work in a vacuum. There’s things that are moving and shifting as it’s all happening. So, don’t beat yourself up if you’re looking at this and thinking to yourself, “This is just a huge task. I’m never going to get this done.” It’s just not something that’s going to happen overnight.




All right, well, hey, if you’ve enjoyed this podcast, if you’ve enjoyed this series, please go to lastweekinaws.com/reviewand give it a five-star rating on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review give it a five-star rating, but then tell us what’s the next series you want to have if you didn’t like this one.



Also, don’t forget, you can give us your feedback and any questions that we’ll be continuing to answer in future episodes lastweekinaws.com/QA. You don’t need to put your name, can be totally anonymous. Give us your question. We’d love to dive into some of those topics.



And finally, you can download our Unconventional Guide, the whole PDF of everything we’ve talked about at the Duckbill Group website. We’ll include that link in our [show notes 00:21:51] and you can head over there. Thanks again.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 09 Apr 2021 03:00:00 -0700
Nobody Cares About the Operating System Anymore

Want to give your ears a break and read this as an article? You’re looking for this link.



Never miss an episode



Help the show



What's Corey up to?

Wed, 07 Apr 2021 03:00:00 -0700
AWS Space Accelerator vs. AWS Global Accelerator
AWS Morning Brief for the week of April 5, 2021 with Corey Quinn.
Mon, 05 Apr 2021 03:00:00 -0700
Win Friends and Influence DevOps: Continual Tagging Improvement

Links:


Transcript

Corey: This episode is sponsored in part byLaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visitlaunchdarkly.com and tell them Corey sent you, and watch for the wince.



Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I’m Pete Cheslock.

Jesse: I’m Jesse DeRose. [laugh].


Pete: Hashtag #FFF. Not my grades in high school; that is Fridays From the Field.


Jesse: We will make it a thing. It’s going to happen.


Pete: It’s going to happen. We’re going to do our best to use the hashtag triple-F as much as possible. So, if you have any questions for us, just again, reminder, you can go to lastweekinaws.com/QA as we talk more about our Unconventional Guide to AWS Cost Management. Please give us your feedback, ask us some questions, we’ll answer those in a future episode. Today, we’re expanding on tagging. Because it’s so thrilling to talk about tagging some more, Jesse.

Jesse: We know that you have struggled to fall asleep at night listening to our podcasts. So, we wanted to do a very special episode just for you, to talk more about tagging. Let’s move into our NPR voices. [silky-smooth voice] Hello, and thank you for listening.

Pete: [buttery-smooth voice] Sponsorship of this—no, I’m just kidding. We’re not—we leave that work to, Corey.


Jesse: [laugh].


Pete: So, today is really about how to win friends and influence DevOps, and it’s all about continual tagging improvement.

Jesse: We talked about the importance of tagging, and one of the things that’s really important to tagging is identifying a tagging strategy, and then building and developing that tagging strategy over time. Your tagging strategy is going to change over time; that is the nature of the beast. Your organization is going to change over time, therefore your organization’s needs are going to change over time, and the tagging strategy and the tagging needs are going to change over time, as well.

Pete: Exactly. You’re going to build new products; you’re going to grow, hopefully; you’re going to add additional Amazon accounts; you can make acquisitions; you could get sold to another business. There’s just so many things that are going to happen, they’re going to change. It’s just inevitable. So, how do you continue this process of tagging, and this is, I think, a really important discussion because when you start that process, you take that first step and you start investing in tagging, the best way to get those—you know, that compound interest on all of the return value that you’re putting into tagging, is by making it a long term, continual process. And I’m not talking about, like, “Well, you know, we do a little thing every month, and it’ll be good by, I don’t know, maybe a month or two, next quarter. And then we’ll be done.”

Jesse: [laugh].


Pete: And that doesn’t work. The best companies that we’ve seen that have really knocked this out of the park have turned this into just a multi-year endeavor. It is going to take you a long time to reach just, like, the pinnacle of tagging, having that ability to allocate just down to the penny of your Amazon spend is going to take a long time. So, manage those expectations appropriately that this is not an overnight fix.

Jesse: So ultimately, at this point, you’ve tagged all of your resources; you’ve built this policy. The next thing to really think about is, why? Because in a lot of cases, a lot of engineers are going to ask you this very question. Why should we tag this information? Why should we tag these resources?

And you’re going to need an answer that’s more than just, “Well, finance wants this information,” or, “Product wants this information,” or, “The engineering leadership team wants this information.” What you’re getting at with tagging is cost attribution. So, at a really high level, for those who aren’t familiar, cost attribution is the process of identifying, aggregating, and assigning your AWS spend to your organization’s teams, your business units, your products, however you want to slice-and-dice that data, whatever different tags you might be leveraging within your tagging policy. So, it’s really about where is your AWS spend going, along these different lines of the different things that finance cares about, that engineering cares about, that product cares about, that IT or security cares about. So, it’s not just about tagging your resources so that everything’s tagged, but it’s about leveraging that information to understand, where are your costs going?


Pete: I think that also gives companies a great KPI—Key Performance Indicator for the non-business folks. But it's a good metric. It’s a good way to track your success with tagging is to basically answer this question: what percentage of spend is tagged? Not number of resources because there are some resources that simply don’t have a cost that have the ability to be tagged. So, tracking tagged by a percentage of resources is, for the most, part not useful.


Jesse: Yeah.


Pete: But tracking what percentage of your spend is tagged—and specifically tags that are enabled as cost allocation tags, which is something that you need to make sure you set up—but by tracking that spend, that KPI, that’s how you can start to understand how good of a job you’re doing at this. Now, again, we’re obviously focused on tags as a cost attribution strategy. But the reality is, is that’s the main use of them on Amazon, specifically. The main use of tags, again, that we see is so people can understand where the money’s going.


Jesse: Yeah. AWS even calls them out as user-defined cost allocation tags. For example, if you want to log into Cost Explorer and see where your spend is going among different products, among different teams, among different business units, you need to make sure that those tags that you’re leveraging are enabled as cost allocation tags in Cost Explorer. So, that’s a really important footnote to call out.


Pete: Yet to that point, is if you do enable your cost allocation tags, there’s maybe some default ones that Amazon will enable for you, but you’ll have to enable any of your own customs. Those take effect going forward; they’re not retroactive. So, if you want to understand which tag is costing you a certain amount of money, make sure to go and enable that as soon as you possibly can because it’s not going to—you’re going to be able to look back at Cost Explorer and see what historical spend was.


Jesse: And this is also a reason why we mention that it’s never, ever too early to start the tagging conversation and tagging your resources because the sooner you start tagging your resources, the more historical data you will have for that particular tag.


Pete: Yeah and that ability to have historical data will help you forecast, which someone’s going to ask you for in the future if they haven’t already, is to try to forecast and plan for the future spend. And historical spend is a great way of understanding that. But again, our focus is on the cost side of things. But there’s a lot of other great usage for tags in Amazon: security, access control, things like that are really useful. I’ve seen companies use tags as a way of marking hosts that were in compliance, you can mark an EC2 host is in compliance when it was scanned and tag it appropriately and then security teams can use those tags.

Super bonus if you can align those two needs of cost attribution plus your security needs because that’s a really great way of incentivizing those engineers to make sure things are tagged really well. Another really interesting open-source product that came out of a previous company—I swear, this is not a joke. It’s an open-source product called Trash Taxi


Jesse: Oh, my God. [laugh].


Pete: It has a fantastic logo. You can go to trash dot taxi and check out the very awesome logo there. A former colleague of mine built this. This is actually a phenomenal use of tagging as a way to identify assets that are running that were, maybe, manually logged into.

So, of course, like, “You should never log into your EC2.” That’s the thing that people say. But everyone logs into their EC2 at some point. You got to look at something, you’re debugging, like, you just need to get on the host if you’re running something on EC2, and that’s fine. And at a previous company, that was an okay thing.


But what if, after someone logged in we could mark that host with a tag and say that this is essentially a tainted host. Because if someone logged in, then theoretically, they could have made a change that falls outside of a normal configuration management run. It’s now different than the others. So, what if we could mark it and then, later on, we can go and just terminate it, let the auto-scaling group replace it. So, that is, again, a really interesting use of tagging.

Other uses that we’ve seen has to do with service discovery. I know in the earlier days, tagging APIs were a little dodgy. I’m—I’ve had outages due to tags going down, which sounds like a crazy thing. But tags are another great way of driving some of your service discovery needs. And again, great way to align with your needs of cost attribution.


Jesse: Yeah, and I think that there are a lot of different ways that you can make sure that these tags are applied, or use these tags for your policy work for identifying service discovery, all these things that Pete just talked about. There are tools like Cloud Custodian, for example, that I used in the past. I used to work for Capital One, and we had extensive use of Cloud Custodian, and as soon as we deployed any resources in any AWS account that didn’t quite fit the framework of the EC2 instances that we were allowed to use, that didn’t live in the correct availability zones or regions that we were allowed to deploy in, or maybe didn’t have the right tags associated with them—how meta can you get? That we maybe didn’t have the owner tag associated with it, or we didn’t have all the necessary tags for our policy associated with that. Cloud Custodian would automatically tag that resource as non-compliant, and potentially send us an email or some kind of message so that we knew that we had resources associated with our team that were non-compliant.

And then after a certain amount of time, if we took no action, those resources would be shut down automatically. Which I don’t necessarily recommend for production—we didn’t actually do that in production, we just stopped the instances, but you ultimately had this opportunity to really clearly enforce your tagging policies.

Corey: This episode is brought to you in part by our friends at FireHydrant where they want to help you master the mayhem. What does that mean? Well, they’re an incident management platform founded by SREs who couldn’t find the tools they wanted, so they built one. Sounds easy enough. No one’s ever tried that before. Except they’re good at it. Their platform allows teams to create consistency for the entire incident response lifecycle so that your team can focus on fighting fires faster. From alert handoff to retrospectives and everything in between, things like, you know, tracking, communicating, reporting: all the stuff no one cares about. FireHydrant will automate processes for you, so you can focus on resolution. Visit firehydrant.io to get your team started today, and tell them I sent you because I love watching people wince in pain.


Pete: Yeah, you often hear these terms, “It’s a governance solution.” It’s a—


Jesse: Yeah.


Pete: —kind of a gross term. Yeah, it doesn’t sound good. Sounds very enterprise-y, like ‘governance.’ But it’s really what it is. You’re trying to enforce this policy in what I would either call the stick or carrot model.

You’re either going to give some negative feedback, like terminating that instance or maybe some positive feedback, positive reinforcement, by emailing that person and saying, “Hey, just a friendly reminder, I’m going to terminate your instance.”


Jesse: [laugh].


Pete: But tools like Cloud Custodian are really powerful because they can help you keep an eye on things when you can’t. Like you can’t always watch everything that’s happening—and hopefully, you’re allowing your engineers the ability to build it and run it themselves on Amazon—this governance solution will ensure that people are, kind of, doing the right things and then fixing it—I think that’s an important one—when they’re not.


Jesse: Yeah, again, this gets back to the idea of, you want your systems to be as streamlined as possible, as easy to use as possible for your developers, to make the right thing the easy thing. So, for example, if you want all these tags applied to your resources, you want your deployment pipelines to be able to add these tags as easily as possible. But someone’s going to forget, someone’s going to log into production and spin up that I3 instance for whatever, quote-unquote, “Testing purposes,” but then you now have a streamlined way to very clearly say, “Hi. I know that this was probably meant to be spun up in a different environment, or maybe you meant to tear this down as soon as you were done testing with it, but this isn’t compliant with our tagging policies, this isn’t compliant with maybe our business policies. This resource either needs to be moved, tagged, some action needs to take place.”

So, there’s really awesome opportunities to provide very streamlined, easy-to-understand messages to your engineers to say, “Hey, this doesn’t align with our tagging policies. This doesn’t align with our business policies. We need you to take some action.” And very clearly give them the action to take place, whether that is adding tags, whether that is tearing it down. Give them very, very easy opportunities to understand how to resolve the problem and how to move forward.


Pete: Yeah, exactly. I mean, there’s really two main strategies here: you set up continuous monitoring, which you’re probably doing already with, like, host monitoring and metrics monitoring, but set up a continuous monitoring with a governance solution, something—either home-built or maybe a Cloud Custodian or other more commercial SaaS products that exist out there—and notify those teams when they fall out of compliance. But then also start introducing those more aggressive approaches, in maybe pre-production accounts first, but over time by terminating those resources, or automatically resizing them if someone used the wrong instance, or whatever. I mean, this stuff exists via these APIs. And you could do maybe both of these. Porque no los dos? Why not both?


Jesse: [laugh]. I will say a lot of the research that we’ve seen suggests that the positive reinforcement incentives work better than the negative reinforcement. So, in a lot of cases, if you can champion the teams that are tagging all of their resources—or sorry, I should say tagging all of their taggable usage—for example, that’s going to ultimately lead other teams to say, “Well, hey, if they’re getting appreciated for their work, if they’re getting rewards for tagging all of their usage, I want to be like them. I want to tag my usage as well.” And that’s ultimately going to be more impactful, more beneficial than if you start punishing teams for not tagging the resources. But to be clear, there’s definitely going to be teams that just don’t care. So, you’re going to need a little bit of both. There’s going to need to be some opportunities for hard love.


Pete: Yeah. I mean, I would even consider providing cash incentives, right? Because these engineers—let’s say you’re identifying, via tagging, that some teams are maybe a higher percentage of tagging or are maintaining their spend better than others, and other teams are maybe a little bit more wasteful or leaving things running. Provide cash incentives for teams to take action. And this kind of goes back into a more broader discussion when you talk about the tags and how you’ve maybe used them for your budgeting purposes, whether it’s a chargeback or a showback.

But by making those numbers a thing, those budgets that teams can work towards—again, another metric you can track—consider savings. If you see an opportunity for a team to reduce their spend, incentivize them with some money. That’s a great incentive for a lot of teams.

Jesse: Yeah. Or if you’re able to, maybe you can incentivize them with additional days off, additional PTO days, or something else, maybe a group activity, and I’d shuddered as soon as I said those words, but maybe there’s something that you can do to incentivize them that really speaks to the team’s interests.


Pete: Yeah, you could do that. You could also do what I used to do at a previous company as in, just secretly replace everyone’s hosts with T class instances.


Jesse: [laugh].

Pete: I’m not even joking. It’s like, “[velvety-soft voice] We’ve secretly replaced his host with a T2. Let’s see, if he notices.”


Jesse: [laugh].


Pete: Spoiler alert: they did notice, but very rarely did we ever have to move it back. We used our monitoring, and our monitoring of performance was our biggest cost management tool to identify those savings. Now, Amazon has resource groups, you can do AWS Config Managed Rules, Cloud Custodian, like Jesse mentioned. And you can even set up some interesting CloudWatch event-driven Lambda functions that can actually go and apply tags or identify things that are missing tags, after the fact. There’s just countless tools out there that can really help you with this.


And always, again, remember that this is continual; it’s always happening. If you can make it a little bit better every day, two, three years from now you’re going to look back and realize, “Wow, we are much further than we were when we started because we did a little bit at a time.”

Jesse: And then this ultimately allows you to make data-driven decisions based on all of this information, whether as Pete mentioned before, that’s forecasting, whether that’s a security or a compliance discussion, whether that’s a cost discussion, you now have all of this historical data for your AWS usage that you can leverage to really understand what you might ultimately be capable of in the future.

Pete: Exactly. So, if you’ve enjoyed this podcast, please go to lastweekinaws.com/review give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review give it a five-star rating on your podcast platform of choice, and tell us how many t2.micros that you deployed for your engineers. Again, don’t forget, please give us your feedback. Any questions as well, you can go to lastweekinaws.com/QA. If you have questions about some of these things or want some additional insights, we’re going to be taking some time in some future episodes to talk more about those. Thanks again, everyone.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 02 Apr 2021 03:00:00 -0700
You Can't Trust Amazon When It Feels Threatened


Want to give your ears a break and read this as an article? You’re looking for this link.


Never miss an episode

Help the show



What's Corey up to?

Wed, 31 Mar 2021 03:00:00 -0700
AWS FaceHugger Now Integrates With AWS ChestBurster
AWS Morning Brief for the week of March 29, 2021with Corey Quinn.
Mon, 29 Mar 2021 03:00:00 -0700
Why Are You Still Paying Retail Prices?!

Links:


Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.


Pete: Hello, and welcome to the AWS Morning Brief. I am Pete Cheslock.

Jesse: I’m Jesse DeRose.


Pete: Fridays From the Field. Triple F.


Jesse: Wooo.


Pete: It’s going to be a thing. We’re working on it. And you can follow along this Unconventional Guide by going to the duckbillgroup.com. Website, you can download this entire Unconventional Guide as a handy PDF. We’ll include the link in our [show notes 00:00:33]. It’s a really long link that I’m not going to read out here.


Jesse: Is it wrong that I want Rebecca Black’s, “Friday” to be our opening intro music now?

Pete: Oh, yeah. That would be, actually, pretty good. I feel like the cost of licensing that might be a little higher than we want to bear. But I don’t know, maybe there’s some sort of fair use thing that we could do with it.

Jesse: I like it. We’ll think about it.

Pete: Well, you know what? We can all just sing it in our heads. And that’s a good way to get it—

Jesse: [laugh].


Pete: —very cost-effective way.

Jesse: We know that you’re groaning as much as we’re groaning, and that’s what’s important.


Pete: That is very true. So, today, we are talking about why are you still paying retail prices for your Amazon usage? And maybe you’re sitting there going, “Well, what else would I pay?” Well, you’d pay less than that, right?


Jesse: Yeah. Last week, we talked about reservations and savings plans, reserved instances. And that’s really important, but today we’re talking about something a little bit different than that. Reservations are still important and still, potentially, part of this conversation, but it’s possible to not pay retail prices. You have to think about it in the same way that you’re thinking about reservations: you have to be willing to make investments into your cloud spend, into your cloud usage.


Pete: So, we mentioned this in a previous episode, that no matter how much your spend is, from a couple of dollars a month all the way up to hundreds of millions of dollars a month, you have an account manager with AWS. You may have never met them, but there is someone that is specifically assigned to you. And the reason for this is that every big-spending client out there starts as a small-spending client, if you’re a startup, you might be spending $10,000 a month. That can be a huge amount of money for your business, but Amazon knows that next year, you’re probably going to spend more than that. And so everyone gets an account manager, and that account management team is there to help you improve your bill.

And by that I mean, help you spend less when it’s possible. So, the way they do this is by helping investing in this relationship. They want you to save money. And I’m not making a funny here, that may sound like a very strange topic. But Amazon doesn’t want you to spend your money wastefully. That makes for angry customers. Right, Jesse?


Jesse: Yeah, this is ultimately something that I see come up again and again. AWS’s account management team really wants to help you; their job is literally to help you. This relationship is super, super important, and can manifest in a number of different ways: it can manifest in your account manager trying to set you up with a solutions architect or technical account manager to use more AWS services; it can be talking about some of the discounts that we’re going to talk about today; it could be a whole slew of things, maybe credits to move or migrate from your data center into AWS. That’s when we’ve seen a couple times with a couple different clients of ours.

Pete: Yes, specifically, we’re talking about one of the most well-known, I guess, of all of the discount programs inside of Amazon called the Enterprise Discount Program. This is often referred to as an EDP. And you might have an Enterprise Discount Program—this is actually separate from something called an Enterprise Agreement which is just, I believe, some shared legal agreements of how you will operate on the platform. This is actually broader than that. This is both Amazon and your business committing to certain terms—so legal is going to get involved; it’s going to be some legal requirements that are needed—but at the end of the day, this is how you can get a discount on your spend, just a straight, broad, cross-service discount that applies to all of your spend—for the most part. I say ‘all’ but for a majority of your spend within Amazon.

Jesse: So, now you’re thinking to yourself, “Fantastic. How do I sign up, sign up? Shut up and take my money.” So, there’s levels to this. We’ve usually seen clients or AWS customers, whose spend exceeds $1 million per year. That’s usually the sweet spot where your account manager will step in and say, “Hey. Hello. Hi, how are you?”


Pete: Yeah. That’s where you get the introduction because at that spend, yeah, okay, you’re at—what—$100,000 a month, at least? Six figures a month, that’s real spend. That’s real spend that’s probably not going to go away anytime soon. And it’s spend that probably is going to increase in the coming years.


Jesse: And even if you’re not at $1 million per year, you can still start that conversation with your account manager today. They can still tell you what are the levers that you have in order to become part of this EDP program? What are the levers that you have to start getting discounts on your usage today?

Pete: So, something we see a lot of, we actually help a lot of our clients, hold their hand through this negotiation process, and help our clients negotiate on their behalf to improve their discounts. And a good number of our clients actually, will preemptively negotiate these contracts in advance of their spend growing on Amazon, basically making these multi-year commitments because maybe they’ve just closed a deal with a large customer, they’re expecting some future growth and they want to make sure that they can get the biggest discount possible. And that’s what an EDP can do is, basically you’re saying, “I commit to spending a certain amount of money per year, and in exchange, I will get a discount.” Now, there’s a lot of nuances here, but the key thing is that when you make that commitment—let’s say, “I will spend $10 million a year for the next three years.” If I spend less than that, I am going to be on the hook for the difference, for the shortfall. And they do this for multiple reasons. Obviously, Amazon is a public company. They want to smooth out their revenues, and these commitments help them do that.

Jesse: Yeah, and I think one other thing that’s important to note is that AWS will likely come to you and say, “Okay, we’ve looked at your past six months of usage, and based on your past six months of usage, we expect you to grow in this way, make this kind of financial commitment in order to be part of this discount program.” Now, you may say, “Well, wait a minute. That doesn’t make sense with our business model, or that doesn’t flow with the fact that, like Pete mentioned, maybe we just signed a big customer and we are going to double our size in the next few months.” But you need to be able to show AWS why the data that they’re looking at may be wrong if you want to ultimately convince them that they should give you a different commit level than what they originally propose.

Pete: Yeah, and that can be a challenge. They’re going to want to see a certain level of growth over the next few years to give you a certain level of discount. You may have additional information that they may not have, so you’re definitely going to need to have a lot of research on your side. As a lot of the previous episodes we discussed, having a lot of insight into your spend, the ability to have clear forecasting that you can share with them. If your future forecasts say, well, our business is only going to see a growth of about 5% year over year, and here is how our business growth is tied to Amazon spend, what your account manager will do is take that back to their internal teams and basically help them come up with a plan that works.


So, the more ammunition you have, the more powerful you—I don’t want to call it your negotiating leverage is because negotiating with them is weirdly rational. They’re kind of following a script, although, in many cases, it’s a little rough that I think a lot of our clients find that process.

Jesse: Yeah, I think that there are definitely levers that can be pulled in order to find different commitments and come to agreeable terms together, but to your point, Pete, it is scripted. The negotiation process is very fixed and regimented; your account manager is ultimately trying to give you the best deal based on the data that they see, but internally, there are only so many different things that they can optimize for. Maybe there are four levers, and if you could tell your account manager, “Hey, this one lever is the most important one for me,” then they can fight for that one. But if you say, “Well, I need all four of these levers. I need all four of these things to be discounted,” or, “I need all four of these things to be available to me,” it’s going to be a lot harder for them to get approval on your discount.


Pete: Yeah, and we actually get this question a lot, which is, “Well, if it is so regimented if there is so little leeway in negotiating these EDPs, where we help out when we do it for our clients?” And oftentimes, it’s helping them understand what is negotiable and what’s not, reduce the amount of time that’s wasted in the back and forth, but then also model out different discounts and different commitment levels because this is another scenario where you can give them some upfront payments for additional percentage points of discounts, but what does that represent over time? Or here’s additional discounts if you make a longer commitment. Talk with the engineering teams, talk with the financial teams. If you’re willing to commit to three years, I honestly think, with the exception of some rare clients that we work with, if you’re in for three, you might as well be in for five because you increase your discounts. And what’s going to happen in three years? Are you just going to be like, yeah, we’re just moving to GCP. Even if you made that statement in three years, it’s going to take you two more to get off; it’s going to—

Jesse: Absolutely.

Pete: —take a long time to actually do a migration of cloud vendors.


Jesse: And I think this gets back to one of the points we talked about in the previous episode about reservations, in terms of the breakeven point: essentially, last time we talked about what’s the breakeven point for your reservations? Are your workloads going to be around long enough that, essentially, the discount is worth the upfront commitment? Same thing applies here. If that you’re going to remain on AWS for this amount of time, if that you are going to hit certain commits in AWS spend a certain amount of time, then it is absolutely worthwhile to consider this program; it’s absolutely worthwhile to negotiate a contract to get Private Pricing.


Pete: I think another point, too, as folks start looking at, should we make these commitments to Amazon and they start that negotiation process, you’re going to really want to—and if not you, your executives are going to really want to say, “Well, just tell Amazon that if they don’t give us a good discount, we’re going to go to GCP.”


Jesse: Yeah.


Pete: And I urge you—don’t ever say those words out loud to anyone.

Jesse: Please. Please don’t.


Pete: Because the reality is, is that the folks at Amazon know more than your executives do about who actually leaves Amazon, and the number of people that do that are pretty small. Now, if you were to threaten to say, “Oh, we’re going to move to Google,” or, “We’re going to move to Azure,” or Oracle, or whatever, if you can show yourself doing that, I would imagine that that is a handy negotiating tactic. Meaning, can you shift a majority of your workload to another vendor within the course of, like, a week—some clients we’ve seen actually do that—and that has a bit more impact to the negotiating process than does just saying, “Well, we’re going to up and move,” and ignoring the fact that you’ve built this bespoke Kinesis-Lambda-event-driven infrastructure that has zero portability whatsoever.

Jesse: Yeah, this definitely gets back into the argument of, “Well, we don’t want to be locked into any cloud vendor.” But you kind of already are.

Pete: Yeah. I mean, outside the fact that you have whatever your architecture is, you’ve all your engineers that you have been, hopefully, building up from a technology perspective that are all experts on Amazon. So, this all goes back to the, in three years, if your business were to say, “We’re going to up and move somewhere else,” you totally have the ability to do that; it’s going to take you years; you’re probably going to lose most of your engineers if you decided to just one day move to Azure. So, again, if you’re in for three, you’re probably in for five, so definitely consider that as well.


Corey: This episode is brought to you in part by our friends at FireHydrant where they want to help you master the mayhem. What does that mean? Well, they’re an incident management platform founded by SREs who couldn’t find the tools they wanted, so they built one. Sounds easy enough. No one’s ever tried that before. Except they’re good at it. Their platform allows teams to create consistency for the entire incident response lifecycle so that your team can focus on fighting fires faster. From alert handoff to retrospectives and everything in between, things like, you know, tracking, communicating, reporting: all the stuff no one cares about. FireHydrant will automate processes for you, so you can focus on resolution. Visit firehydrant.io to get your team started today, and tell them I sent you because I love watching people wince in pain.

Jesse: Now, with all of this said, there are definitely lots of different opportunities for Private Pricing within this space. And we’ve talked a little bit about, you know, there’s different levers that can be pulled. Some of that is Private Pricing for specific services. So, for example, if you know that you are very storage heavy—you use a lot of S3—maybe you want to focus on Private Pricing for S3, you want to focus on some deep discounts for your S3 storage. Or maybe if you are very compute-focused, you want to focus on discounts for EC2, or maybe if you are very focused in other areas, you want to focus on Private Pricing for those things.

But again, this comes back to the conversation of what are the levers that are most important for you. Understand that there are definitely a lot of different opportunities to get Private Pricing with a lot of different services. And unfortunately, AWS doesn’t really explain what all of those are upfront. This is one of the things that we found frustrating for clients and for ourselves. There’s a lot of different things that we’ve seen in terms of baseline Private Pricing, but there’s definitely more that we haven’t seen and it’s not as clearly documented as we would like it to be.


Pete: And this actually is a great segue into the other type of big discount program. So, you have the cross-service discount of an EDP; within that EDP you might be able to negotiate, like Jesse mentioned, specific per-service discounts. Maybe you’re storing 10 petabytes of data on S3, but if you’re paying retail price for that, that is quite expensive. And you’re using it quite a bit. What you can do is commit to spending—continuing to use—it’s the usage commitment—a certain amount of S3 over a period of time and they will extend you discounts for that.


That’s a place where these are often called a Private Pricing Addendum. Sometimes the EDP is called a PPA. There seems to be no solid terminology on what these are. But the most important part is we normally consider an EDP a multi-year commitment of a certain spend amount, and inside of it might be some per-service discounts, but broadly, there’s a cross-service discount on the top. You can outside of that whole EDP process—even without an Enterprise Discount Program—get a Private Pricing Addendum—a PPA—where if, let’s say you’re using quite a lot of data transfer of a certain type, or a lot of CloudFront and maybe not a lot of other things.

Again, you can make those usage commitments and say, “I’m going to commit to sending a certain amount of data through this service over the next year. And again, in exchange for that, I’ll get a discount.” And what’s interesting is that the commitments can feel difficult to get across to financial teams because you’re going to commit to a certain usage, which means you’re going to commit to a certain spend. It is a commitment, but it’s a discount of commitment. And so when you start figuring this out, let’s say if you were able to get like a 50—five-zero—percent discount, on a certain service that you’re going to commit to, your usage would have to drop by half before you were paying on-demand rates again.

I’ve done that in the past where I’ve looked at it and be like, “Well, I might be able to make some optimizations to reduce my network data transfer. Do I really want to make this commitment?” But the reality is, is that I would have to reduce it by the discount amount in order to be back to just paying retail prices again. So, you got to think of those breakeven points, essentially, just like you would think about those, as we’ve talked about in the previous episode of your reservations; where are the breakevens at?

Jesse: Yeah. So, there’s definitely work to do here, unfortunately. We are sending you home with homework; I’m so sorry. But it’s homework that is worth doing because, again, the more data that you can show to finance, to leadership to sign off on making these kinds of investments, and the more data you can bring to AWS to show why you should be getting the kinds of discounts that you want, the better discounts you’re going to get, the less you’ll end up spending on your AWS bill.

Pete: Yeah, that’s important, too., you’re not going to, again, negotiate for a certain price. Like, eh, maybe a little bit, it might swing here and there for some services, might be different than other services. Largely, there’s tiers, there’s usage tiers; you’re going to fall within some of those tiers. But oftentimes, too, I find that I might see a client with a large amount of spending a service still—again—paying retail prices, and if your account manager is worth their salt, and you go to them and say, “Hey. Is there Private Pricing for AWS Systems Manager Login Manager”—whatever random service is out there, their response should hopefully be, “You know, I don’t know, but let me go find out.”

Because if, again, your usage is at a substantial amount and you’re willing to commit to that usage, service teams like those commitments; it’s obviously good for so many different reasons, and if you can make those commitments, just like you would make a reservation commitment or a multi-year EDP commitment, there are discounts that can come along with that.

Jesse: Yeah. So, start that conversation with your engineering teams internally to look at these usage numbers, and start that conversation with your account manager today. Start reaching out to ask them, “Hey, we’re seeing a lot of spend here. Is there opportunities for discounts?” Now, also keep in mind that if you haven’t already purchased savings plans and reserved instances, that might be the first place that your account manager suggests investing some money and some time, but past that there are absolutely opportunities for other discounts.


Pete: Yeah. And again, your account management team isn’t out to get you throughout any part of this process. They’re oftentimes just looking out for your best interest. And the key thing is that when the EDP is signed and over with, then you’re still having that relationship with them, that professional relationship over many months or years that they may be assigned to your team. So, they’re not trying to make this a difficult thing.


They’re hopefully trying to let you know what is reasonable and what’s not. But even inside of Amazon, there are just too many places to know where to go, what discounts exist, they don’t really have a good process around this. And so that’s where asking these questions, having them chase this stuff down, is the biggest thing that you can do internally. And I feel like I wonder if Amazon account managers are like, “Oh, God, Pete and Jesse. You’ve just given me all this additional work.”

Jesse: You’re welcome.

Pete: Well, it is Amazon review period right now, so if I did have the ability to give some account managers a positive review, I absolutely would.

Jesse: Yes.

Pete: And you know what? I will say this: you most often can give your account manager positive feedback. They should be including a link in their emails. This is Amazon we’re talking about. This place operates on data-driven feedback.

If your account manager goes above and beyond for you, click that little link and say something nice to them. That is directly impactful to their livelihoods. And if they are going above and beyond—which they definitely don’t need to—you want them to continue on at that business.


Jesse: Yeah, we’ve definitely seen account managers and technical account managers who have gone above and beyond and have created great, great relationships with their AWS customers. And that is so critical to an ongoing healthy relationship using AWS.

Pete: Finally, if you’re thinking about using a new service within Amazon, ask if there’s any investment credits to help you adopt that service. I’ve even seen certain types of credits existing as a SaaS provider when we want to do a large-scale proof-of-concept for a new client. Amazon, they’re not dumb. They know new clients means more servers. So, they will very often give you some credits.

If you were like, “Hey, we have this client. We want to give them a three-month POC; it’s going to cost us this amount of money; here’s the business case.” They will probably get you those credits. They know what that means. But again, they’re not going to do this stuff if you don’t ask. And of course, sadly, a lot of these things—they are getting better at communicating, but the information just isn’t widely out there.


And then finally, if this is all just super-complex, reach out to us at the Duckbill Group, we negotiate more EDPs than most Amazon account managers do, and we can really help you gain some confidence in that process, so. You can also go, like I said, to the Duckbill Group site, we have this Unconventional Guide up on the site. We’ll include a link in the [show notes 00:21:25]. And yeah, if you have any questions or feedback about this episode, about an EDP or a PPA, go to lastweekinaws.com/QA. We would love to answer those questions. Feel free to put your name in, or just leave it blank. That’s totally fine too.

So, if you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review give it a five-star review and tell us what is on your wishlist for your next Private Pricing or EDP negotiation. Thanks.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 26 Mar 2021 03:00:00 -0700
Sell Me an AWS Service, But Crappier

Want to give your ears a break and read this as an article? You’re looking for this link.



Never miss an episode



Help the show



What's Corey up to?

Wed, 24 Mar 2021 03:00:00 -0700
$500 Million in Request Charges Isn't Really a Request
AWS Morning Brief for the week of March 22, 2021 with Corey Quinn.
Mon, 22 Mar 2021 03:00:00 -0700
I'm Sorry, Do You Have a Reservation?


Links:


Transcript


Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I’m Pete Cheslock.

Jesse: I’m Jesse DeRose.

Pete: We’re back again. We’re continuing the Unconventional Guide to AWS Cost Savings. What are we talking about this week, Jesse?

Jesse: This one’s actually one of my favorite topics. I feel like I say that every episode, but they’re all my favorite topics; just don’t tell any of them that. This week, we are talking about investing in your future. We’re talking about making investments in the AWS platform in terms of reservations.


Pete: Awesome, yeah. I mean, there’s usually a return on investment. But investments are a complicated part. I mean, there’s a lot of different ways that Amazon is happy to take your money, right?

Jesse: Yeah, absolutely. And I feel like this is one that people are aware of tangentially, but I don’t think a lot of people think about regularly. I really wish more folks would make a habit out of regularly looking at usage and looking at the potential for reservations. Because as you said, Pete, there are amazing opportunities to receive a return on that investment, and I don’t think enough companies are taking advantage of that.

Pete: Yeah, there’s a lot of nuances, and we’ll dive into all those things. But before we get started, just want to remind all of our listeners that this Unconventional Guide, you can actually head over to the Duckbill site and go and download this guide, we have it as a handy PDF, for review. Obviously, it’s going to cover some of the future episodes as well. So, you get a little bit of a sneak peek there.

Jesse: Spoilers.

Pete: But if you do better with a written format, it is available. I would read the link off but it’s comically long and figuring out short URLs, we just haven’t reached that level of technical ability over here. So, we’ll include the link to that PDF in our [show notes 00:02:01], and you can go check it out at duckbillgroup.com. But also to go, too, lastweekinaws.com/QA and ask us questions. Send us your questions, your thoughts, your comments, your feelings. As someone I used to know a long time ago, your bitches, moans, groans, and complaints, just add them all in there. And you can add your name; you don’t have to, you can just send it anonymously. But ask your questions. We’ll be taking some time in future episodes to go into those questions and dive in deeper on some of these particular topics that people might be a little confused by or maybe just want some more insight into.

Jesse: Yeah, we’ve gotten some great questions so far that we are planning on future episodes for, and please keep the questions coming. There’s some really, really great questions, really, really great commentary in there. And we absolutely want to make this an engaging conversation. We want this to be a two-way conversation.


Pete: Absolutely. So, diving into investments, I’d have to go online and do some research, but I’m pretty sure it was probably the EC2 instance reservations, were the first type of commitment that you can make to Amazon. And again, if I’m wrong, folks out there listening, please go to lastweekinaws.com/QA and let me know of that. Or you could just tweet me as well at @petecheslock. That’s what most people do is, when I’m wrong, it just tweet at me. Right, Jesse?

Jesse: Yeah. I mean, well, I have a direct connection to you, but if I didn’t, I’ll just tweet at you.

Pete: Yeah, you’ll just tweet at me or Slack DM me or whatever; send me a Zoom message, or maybe hit me up on Chime.

Jesse: Oh, god, yes. If somebody is hitting you up on Chime, you know you’re in trouble.


Pete: That’s very true. [laugh]. Something has gone wrong if I get a message on Chime. But what’s interesting is that the instance reservations was a way of ensuring capacity, and you could basically commit to running an instance, an availability zone in a certain region, and that instance would be there for you. It was a capacity reservation, which is actually something different now, which we might touch on later, but it wasn’t really like a, “Give me a discount.” That came later.

It was an instance reservation: reserve this instance. And this was important because for those folks who have been part of Amazon in the earlier days, there were times that you would ask for a certain instance type in a certain availability zone and Amazon would kindly tell you to go pound sand because they didn’t have one of those for you.


Jesse: Yeah, this is something that we’ve seen with a number of clients who are largely multiregional and leveraging basically every instance type you can think of under the sun, and really putting all of these compute resources to their limits. So, getting some kind of confirmation that they would have this capacity available is kind of important.


Pete: Exactly. I remember specifically—this was yeah, maybe 2010 timeframe, kind of the heyday, the wild times of Amazon—we had been running—a company of mine had been running a sizable NFS cluster on EC2. “Why would you do that Pete? That’s a terrible idea.” Of course it’s a terrible idea.


We didn’t do it by design; we did it because we were a startup, and that was a proof of concept that got out of control, like most technology, right? But when we lost the NFS server itself, we had—I can’t even tell you how many—let’s say 50 EBS volumes that were all striped to this server because that’s a great idea. And we needed another server in that availability zone. We’re not going to snapshot, like, 50 terabytes of EBS. I don’t even know if that capability existed then, to move snapshots across availability zones.

So, we needed another instance, and luckily we had a great relationship with our account team—because we were so early—that I do remember, specifically, we got through to the right people. And the line was essentially, “You need to make this API call in the next 15 minutes, or you’re going to lose the instance that we’re basically setting aside for you.” [laugh].


Jesse: [laugh]. That is the best layaway plan I have ever heard
.


Pete: I mean, it’s been a decade now, and I still chuckle at that one, just having to guess what they had to do to actually get us that instance. But the reservation would have reserved that instance, and as a benefit, you would have gotten a discount for that. And a little bit later, I think it was maybe a few years later, a year later, hard to know, obviously, exactly; just, time has no meaning anymore, especially in COVID times, but the instance reservations, they kind of moved away less from capacity reservations. So, when you made an instance reservation, you’re making out a specific AZ. And oftentimes, you’ll hear these actually now referred to as a ‘zonal reservation.’


Like, “I want a c5.large in us-east-1a, and that’s for the next year.” But then a new type of reservation came out that was more to save money. It was like, “I want a c5.large in us-east and I want the flexibility of running it in any availability zone.” And so you wouldn’t get a capacity reservation for that, but you would get a discount. And that was kind of that first type of commitment, which was if I commit to running this instance 24/7 for the next 12 months, they will extend me a discount. Maybe not as good as the zonal reservation, but it’s a lot better than retail pricing.

Jesse: Yeah. And I think that’s the ultimate idea here—you know, flash forward to today—is AWS wants you to be happy with their service, they want you to use their service, they want you to engage on their service, obviously in some cases they want you to give them more money, but the best way that you can prove to them that you are going to continue using their service for let’s say, the next year, or next three years if you are really secure in your business plans is to purchase one of these reservations to show AWS, “Hey, I know that I am going to spend this much money on compute services, whether that’s an EC2, or RDS, or whatever other reservation types are available today.” And therefore AWS will say, “Okay, great. That’s awesome. We really appreciate your service, we appreciate your dedication, we appreciate you being a continuing customer. In exchange, here’s a discount.” And like you said, Pete, if it’s a zonal discount, it might be a little bit less than a straight discount, but there are varying levels there of, “Hey, we want to show our appreciation for you being a continued AWS customer.”

Pete: Yeah, essentially it breaks down to is that the larger, the longer, and more specific your commitment can be, the greater your discount. And when you say it that way, you’re going to say, “Well, that’s obvious.”

Jesse: Yeah.


Pete: But you would be surprised how many people don’t take the time to think through that. Because we all have a million other things going on; who’s going to obsess about instance reservations, and what’s the best discount? Just really us, right?

Jesse: [laugh]. Yeah. Well, and I think it’s also important, too, to note, that when you talk about a lot of these reservations, if you’re talking about potentially a lot of money that ends up on the bill for finance, or ends up on the bill for someone in leadership to approve, and just at a glance those numbers can be really scary. Even if it’s numbers that they’re used to seeing across the business in general, they still want to understand why are we spending so much money on AWS all of a sudden. And you need to be able to have that conversation to explain, “Okay, well, we may be spending money upfront, but think about how this cost is going to be amortized over the course of this reservation.”

Whether that’s a year, whether that’s three years, and think about how much benefit we’re going to get; think about the return on investment. So, there’s all these little things that you can use to have this conversation with finance, with leadership, to make sure that they understand that this isn’t just about spending a ton of money right now, it’s about making a really solid investment for your cloud infrastructure long term.

Pete: Yeah, I think that topic ties into an important point when deciding to purchase a reservation. A practice that I followed pretty religiously in previous companies, very diligently month by month, I would always review instances that were not under a reservation. And if I could commit to running that, if I knew that service, that instance, was probably going to be around in its current state for at least eight or nine months, I’d buy the reservation. Because what I was doing was calculating the breakeven point. If the discount was about 30 to 40% then you can figure out when your breakeven time is. If it will run full time up until that breakeven point, you’re doing no worse than by doing nothing.

Jesse: Yeah.


Pete: It’s a on-demand up to that point. And the way I would even kind of rationalize it in my head was like, “Yeah, we’re just going to pay the same price as we are now, but then we’re just going to get—for every day after, it’s free.” Which is another way of breaking that out. It’s just how you think of it. But calculating the breakeven point is something that I don’t think a lot of folks out there are doing in—

Jesse: No.

Pete: —I think more often, they just decide that making that reservation is too big of a commitment versus, “Well, doing nothing is one type of commitment, and making a decision is another type of commitment.” What’s the trade-off?

Jesse: Yeah, and we’ve talked a lot about gathering data, making data-driven decisions about your cloud costs, and I think this is another one. It’s really important to understand, hey, is this microservice, is this product going to be around eight to nine months from now minimum? If so, it’s absolutely worth making this investment. Or I should at least put the caveat of, is this microservice or this product going to be around in its current architecture state? Because in a lot of cases, I’ve seen folks who say, “Yes, this service is going to be around.”

And then they end up purchasing reserved compute capacity, but then they move to newer compute instances, or they move to a slightly different infrastructure model that doesn’t use the same resources, so they end up not leveraging the full impact of that reservation. So, all these are little things to think about as you are thinking about how best to optimize your spend and really invest in your cloud spend.

Corey: This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the Enterprise (not the starship). On-prem security doesn’t translate well to cloud or multi-cloud environments, and that’s not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35 percent faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com/trial.

Pete: Yeah, and everything has a cost with an Amazon. Doing nothing has a cost; doing a lot of things has a cost. So, you have to break and balance that out as well, break it out into how much time you want to spend to answer these questions. You can spend a lot of time and manage all of your reservations yourself and by zonal instance reservations, but the overhead, the time commitment, and software required to do that well without waste is high. On the flip side of it, you can get less discount but more flexibility—and that’s the trade-off that we’re usually discussing is, maybe it’s less flexible with a greater discount, or more flexible with worse discounts.

And that’s where on the Savings Plans—that’s the new service that came out, what about a year ago? Maybe over a year ago—where you can just make a per-hour commitment. You’re running $100 per hour of on-demand spend, commit to a portion of that and we will provide you a discount—“we” being Amazon. Obviously not Duckbill; we can’t provide you any discounts other than telling you all those places that you can save money in your bill. But the Savings Plan is simplicity in a nutshell. It’s just per hour.

And it’s actually the only way to include reservations on things Lambda and Fargate; those don’t have their own reservation plans. So, if you’re considering—if you’re kind of like, “Ah, I want some reservations, but I just don’t know what I’m going to be running six months from now.” Then go buy a Savings Plan and move on with your day. And honestly, for a lot of companies out there who are really good at managing their reservations and convertible reservations and things of that nature, think about the time you’re spending on all of that overhead and calculate it in. What you may find is that that Savings Plan is actually a bigger savings because you can just buy it and then go back to your real job.

Jesse: Yeah. And I think it’s also important to note that Savings Plans will stack on top of each other. So, if you purchase a Savings Plan that’s maybe a little bit lower than the ideal recommended Savings Plan—either by AWS or by a third party—you could always come back and revisit that Savings Plan purchase later, and purchase another Savings Plan to augment that. You can always say, “Hey you know what? I thought that maybe X was too high, so I’m going to buy a Savings Plan for about half of X every hour, but now I see that I actually am spending X every hour.” So, you can augment that and purchase another Savings Plan to add to that and ultimately they say stack on top of each other.

Pete: Yeah, so Amazon’s going to recommend some number that is what is the absolute most you should give them? And honestly, it’s a little crazy. It’s going to look back at your spend and say, “Well, based on your previous spend, give us $100 an hour for the next year.” And you’re like, “Well, that’s a lot. What if I only want to do half that?”

Now, they can’t answer that question. And interestingly enough, Duckbill has some tools—and you should reach out to us if you’re trying to understand that question—we have some interesting tools that can answer that for you. But to Jesse’s point, if they’re recommending a certain commitment, you could commit to, maybe, 25% of that this quarter. Then reassess next quarter and maybe make another commitment of 25%. It’s kind of handling it the same way you might do a reservation.

You might do quarterly analysis of your reserved usage and adjust accordingly. Don’t feel like you need to make that one click and say, let’s buy the $100 an hour commitment, one year, all up front, I’m going to click this button and the next thing you’ve got this, like, you know, million-dollar invoice payment.

Jesse: [shouting] YOLO.


Pete: It’s the most expensive API call.


Jesse: Yeah, absolutely. There are definitely opportunities to make small incremental investments in your future within this space. But I think another thing that’s worth calling out is that we’ve talked specifically about Savings Plan, we talked a little bit about reserved instances as well—which fell out of the initial reservations that Pete was talking about—but there’s other reservation options available within AWS that aren’t talked about as frequently. And usually, that’s because these are services that maybe aren’t used as frequently as, well, saying EC2 or RDS. But there are still other reservation opportunities within AWS, so it’s always worth looking to see, is there a way that I can invest in my usage in this particular AWS service? Is there an opportunity for me to get some kind of discount here?


Pete: Yeah, that’s a really good point. There are the niche services right within AWS, that you may not realize that you can actually create a reservation for. I’m not going to call DynamoDB one of those—I don’t think that’s a niche service—but you can reserve capacity in a DynamoDB setting to increase your discounts there. I’m talking of things that are really obscure; that you’re not going to know about it unless you’re a user of it. Like, MediaConvert has reservations.


And MediaLive—streaming your media—you can make certain commitments—again, that’s what this is all about—commitments to Amazon for your usage and get some discounts on there. So, check those out. And also, too, if you have an account manager—which actually you do. No matter what your spend is, you have an account manager. You may not know who they are; they may have never reached out to you, but your account manager can also help you identify some different places that reservations could happen, and even give you an idea of what kind of savings that you could see, based on your particular service. So, even if you don’t see a reservation plan for, maybe, a type of service you’re using that is of a high level, reach out to your account management team because they can be really helpful to find out what savings might exist.


Jesse: Yeah, the key point to really remember in all of this is to start tracking your use of AWS services as they grow and stabilize over time, much like some of our previous conversations about starting with GP2 EBS volumes for your unknown workloads, or S3 standard storage for objects with unknown access patterns. Start measuring your usage today of these different AWS services, and when you start to see a stable baseline usage over time, invest in that usage to maximize your savings.


Pete: Absolutely, the more insight you have into your Amazon spend is a great way to better understand future usage and planning capacity. But as soon as you have the slight, even, concept of this is going to live for a period of time, make that commitment. The sooner you can make the commitment, the sooner you can get savings, and that’s the best way to save on Amazon is to save the money before you’ve spent it; before you’ve gone down the path. There’s no retroactive savings. So, this is the best way to do it is again, the sooner you can make those commitments, the better it will be, for you and your cash and bottom line. And it makes your C-levels, your CFO really happy when you can explain to them the benefits of these reservations.


All right, if you’ve enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review. Give it a five-star rating on your podcast platform of choice and tell us what is your percentage of reserved instances? Is it up in the 90s? Should be. Don’t forget, we want to hear your questions: lastweekinaws.com/QA. Send us your questions, your thoughts, and your feedback. We’d love to address all of those in a future episode. Thanks again.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 19 Mar 2021 03:00:00 -0700
The Future of Cloud is Microsoft's to Lose

Want to give your ears a break and read this as an article? You’re looking for this link.

Never miss an episode



Help the show



What's Corey up to?

Wed, 17 Mar 2021 03:00:00 -0700
Word-level Overconfidence
AWS Morning Brief for the week of March 15, 2021, with Corey Quinn.
Mon, 15 Mar 2021 03:00:00 -0700
Listener Questions 2

Links:


Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Corey: Ever notice how security tends to be one of those things that isn’t particularly welcoming to folks who don’t already have the word ‘security’ somewhere in their job title? Introducing our fix to that, Meanwhile in Security. To sign up for the newsletter or to find the podcast, visit meanwhileinsecurity.com. Coming soon, from The Duckbill Group.

Pete: Hello, and welcome to the AWS Morning Brief. This is Fridays From The Field, hashtag-triple-F. I am Pete Cheslock.

Jesse: I’m Jesse DeRose, and I have a question: is it hashtag-triple-F, or is it hashtag-F-F-F? Are we spelling out triple F in this hashtag, or is it just literally three Fs?


Pete: The three Fs is a little triggering for me for me, with my high school grades, so let’s just stick to—

Jesse: [laugh].


Pete: —hashtag—


Jesse: —triple-F.

Pete: Triple-F, I think, just has a better flow to it. But that’s a good—it’s a good point in our continued effort to make triple-F—hashtag-triple-f a thing.


Jesse: All of our audience members were really concerned about that one because they’ve been trying to get us trending on Twitter, but they weren’t really sure, was it triple-F. Or was it F-F-F, or was it something in between?

Pete: Exactly. It’s just bad. But we’re going to keep trying at it, and we’ll see what happens. Well, anyway, we are back again to continue our Unconventional Guide to Cost Optimization on AWS with another listener question. And unlike the last time we did listener questions, this question actually came in during our Unofficial Guide, which means we actually have one listener this series. Because we can’t count the last one that was from way before. So, to this one listener, thank you, thank you for listening.

Jesse: Just that one listener. Just you. Thank you.

Pete: Yeah, just you. Everyone else, no, we’re not going to, we’re not going to thank you at all. But if you want to be our second listener, go to lastweekinaws.com/QA and give us a question. What do you want to know more about?

What can we dive in a lot deeper on any of these topics we’re talking about? It’s complex stuff, and we’re all learning this, we’re all trying to figure out what works best. And not every company is the same. And that’s what I actually love about this question because this question actually came in from someone who didn’t put their name—but that’s okay—they work in the public sector, which is why they didn’t put their name in there. And they had a pretty interesting question. So, Jesse, maybe you can read this off for us and let us know what we’re going to be answering today.

Jesse: Yeah. This question is, “We’re an Azure shop, partly cloud on the way, however, we’re also becoming an Oracle OCI shop”—I’m so sorry—“And an AWS shop, and well, it’s public sector, so one-of-everything cloud provider. How do we convince management that cloud is a different thing than on-prem and needs some kind of cloud team? I dislike the phrase DevOps as a job title, but we need something to change the current model where nearly all of this work is outsourced to a quote-unquote, ‘managed service provider?’” Oof. I have so many feelings.

Pete: I would imagine. I mean, I was immediately—I felt called out, you know? Just @ me next time, public sector coward with the DevOps-as-a-job-title phrase.

Jesse: Yeah.

Pete: They often say that only a DevOps tool, I guess—wait, what’s the term? It’s like, “A DevOps tool would give themselves a DevOps as a job title.” Of course, that’s often said about me because I gave myself a title called ‘DevOps Director’ or ‘Director of DevOps.’ Either way, you phrase it, it’s all pretty bad.

Jesse: Yeah. So, there’s a couple of different questions in this, and we’re going to dive into each of them individually. But really, really quick, I want to talk about multi-cloud because that’s kind of the underlying discussion here; something that is not necessarily the focus, but let’s talk about multi-cloud. Why is multi-cloud a thing? Why is it an important thing that you should be thinking about?

Pete: Multi-cloud is an interesting topic that could go a lot of different ways. And I call multi-cloud a lot different than hybrid cloud. I think most people are probably doing hybrid cloud, meaning you’ve got some data centers—because it takes you years and years and years to move off of those—and you’ve also got cloud workloads, or maybe you’ve got some data centers and you’re bursting up to cloud workloads; that’s pretty cool, too. I think of multi-cloud as individual applications being deployed to the cloud vendor and cloud provider, based on maybe price or features or things like that. And honestly there, a lot of the cloud providers are getting closer in feature sets.

But for example, I might want to use Lambda, but I may not want to suffer high cost of data transfer. So, can I build an application that leverages Lambda, but maybe leverages the extremely low cost of Oracle’s OCI data transfer? That made the news when Zoom signed that big contract with Oracle, it was largely driven by network data transfer. So, there are some reasons why multi-cloud might be a thing.

Jesse: And we’ve definitely seen multi-cloud in practice with some of our clients. But I also want to call out the caveat that the clients that were doing this were very mature in their cloud cost practices. So, kudos to those clients because they’re doing amazing, amazing work. But it takes time to really build up a mature, scalable, optimized, multi-cloud strategy.

Pete: Yeah, exactly. And I think the biggest challenge is that we see is, on the one hand, if you say to yourself, “I’m going multi-cloud, therefore, I will only consume core primitives like compute, block, store, object store, networking,” even though all the providers will provide you those services, obviously, the APIs to interact with them will be wildly different, but most importantly, the authentication models are going to be wildly different, how you authenticate each one of these is going to be all over the place. And that’s going to pose a pretty big challenge.

Jesse: Yeah. So, I think that ultimately gets into the first question that we want to focus on here, which is, how does developing and operating workloads in “The cloud”—quote-unquote—differ from an on-prem data center? And bonus points, how does it differ from each other, which we’ll talk about a little bit here, too. Now, when you’re thinking about on-prem versus the cloud, the first thing to think about is that your finance team is going to want to better understand your spend because they’re used to a spend model where all of your resources are purchased upfront and then depreciated over time. But now, your spend model has completely shifted to a more granular model focusing on actual usage individually over time.

Pete: Yeah, this is an interesting one. And this is the classic OpEx versus CapEx. That’s operations expenditures versus capital expenditures. A capital expenditure is something—I usually call it something you can touch; it’s a thing. A server, that’s capital expenditure.

These are largely accounting terms and should not be considered the scary things for businesses because you’re spending the money either way, it just differs about how the money is spent on a cash basis. And we could go off forever on this one—

Jesse: Yeah.


Pete: —and I really don’t want to. But there is a difference here that defines it. I think another thing I like to think about when it comes to engineering from the data center world to the cloud world is, the way in which you operate will be charged differently, just by nature. And again, I know we harp on data transfers so much here, but it’s because the last thing people think about in a data center world, your data transfer, you may not even think about it; it’s just there, right? Some very advanced networking engineers set up this network for you; you just use it.

You’re probably not even charged for it or metered on it. That model breaks down very quickly. So, if you had an application and you were pushing uncompressed JSON all over the place because who cares? I want to spend CPU cycles on compression? I don’t need to do that, I have unlimited networking. That model is going to show very bad things in the cloud. And you have to think about that before you go down that path.

Jesse: Yeah, this really gets at the idea of total cost of ownership for these resources. Don’t just think about, “I’m buying these servers to run my application.” You need to think about also the data transfer associated with those servers, for example. You need to think about the engineering time required to manage those services. Maybe your company has decided you’re going to move to a Kubernetes cluster; you’re going to put Kubernetes clusters in each of the cloud environments that you spin up with the different cloud providers so that your developers can focus on just building containerized application workloads and just deploy them wherever. It doesn’t really matter because there’s just Kubernetes everywhere.

Pete: Exactly. It’s—I think that concept of, “Oh, it doesn’t matter. We just have Kubernetes everywhere,” leads into the next thought, which is how are you deploying to your data center assets? But then also, how are you deploying to cloud A versus cloud B? If you adopt some of the cloud-native solutions, those don’t translate really well between providers, even ones you would kind of expect to, right?

Like EKS on Amazon, their Kubernetes service doesn’t have a direct translation to the Google Container—the GKE—or Microsoft’s—is it Microsoft DevOps, or whatever they call their Kubernetes release. [laugh]. Whatever stupid name they gave it. But that’s a big point: even though you’re using Kubernetes and all these cloud vendors, the way that you interact with them is going to be wildly different outside of just what we normally say is the authentication side of it, just in the features and the APIs that you discuss with.

Jesse: Yeah, the last topic that I want to touch on here before we move on to the second main question in this discussion is public sector in general. Pete, I know, you’ve got some thoughts and feelings on that one.

Pete: Yeah. So, there are going to be a lot of constraints the public sector is going to have to deal with that, you know, my startup that just got funded yesterday for an always-on chat system is going to have a lot of different requirements on. Risk and compliance will be usually associated to these public sector clients a lot more stringently than most other companies out there. Whereas they might even have a dedicated risk and compliance team. And those compliance needs will drive a lot of architectural decisions, and thus, will actually drive a lot of cost associated with it.

So, in many cases, you’re not going to be able to—and we’ve actually seen this with clients of ours: we’ve worked with a lot of clients that are very stringent in their risk and compliance, and we’ll find places that they could potentially save or optimize, but due to their compliance needs, they’ve told us, “No, we can’t do that because risk needs this thing. Compliance need that thing.” So, some of those times, you just can’t change what you’re doing and how you’re doing it in the public sector, just because you’re not allowed to. And so those are places where honestly, don’t fight that battle; just accept it, but make sure all of the stakeholders understand those requirements.

Jesse: Okay, so now let’s address the second part of this question, which is really talking about the pros and cons of an internal cloud or DevOps team versus an outsourced team. And off the bat, I’m already upset with myself because I’m not a fan of the DevOps job title. DevOps is not a job title, DevOps is a philosophy.

Pete: Yeah. I’ve gone back and forth on this one. And I feel like we could fill a whole podcast on this topic. And I’m sure people haven’t beaten this one to death. But I had DevOps as a job title because I was trying to create this concept that we’re talking about—center of excellence, or it has a lot of other terms—and I said to myself, “Well, if we want to implement DevOps within this business, then we want someone to be in charge of that who can help level up all these teams.”

And the downside is when you are the Director of DevOps, you suddenly own DevOps, and—for whatever that means; you own it and people are going to expect that everything falls on you. And that feels like a silo, right? Which is not what DevOps is all about. So, on the flip side, though, we all know that people with DevOps in their title—or SRE—you’re going to get paid, like, 20 or 40 percent more than someone with a sysadmin title, so go get paid, people. Put that in your title if it gets you paid. But from a higher up, a director or VP, don’t be a VP of DevOps; that is a dangerous job to have.


Corey: This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the Enterprise (not the starship). On-prem security doesn’t translate well to cloud or multi-cloud environments, and that’s not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35 percent faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com/trial.

Jesse: Yeah, so I really want to quickly highlight, define, this idea of its outsource team. Outsource teams usually become a center of excellence. And a center of excellence is defined as a team, shared facility, or an entity that provides leadership, best practices, research, support, and training for a focus area. In this case, this outsourced managed provider would be the focus for all cloud cost management. But there’s pretty recent research that shows centers of excellence aren’t usually the best way to get work done in an engineering space.

Jesse: Yeah, so I really want to quickly highlight, define, this idea of its outsource team. Outsource teams usually become a center of excellence. And a center of excellence is defined as a team, shared facility, or an entity that provides leadership, best practices, research, support, and training for a focus area. In this case, this outsourced managed provider would be the focus for all cloud cost management. But there’s pretty recent research that shows centers of excellence aren’t usually the best way to get work done in an engineering space.

The 2019 state of DevOps survey report asked its respondents how their teams and organizations spread DevOps and spread agile methods within their organization, and they noticed two really interesting things: low-performing organizations focused on strategies that created more silos and isolated expertise, which in some ways makes sense, but that means that they were siloing all of that expertise and information. They created this disconnect between the people who are creating the best practices in the center of excellence and the people who were following the best practices or implementing the best practices within the individual teams, whether it’s a product team, or a specific engineering team, or something else.

Pete: Yeah, there’s a lot of research out there. That state of DevOps report that is wonderfully researched talks about this a lot. And just from my own personal experience, one method that I’ve taken when trying to implement cultural change and leveling up the technology chops of an organization has maybe had a little bit of that concept of a cloud center of excellence, center of excellence type of thing, but it’s been treated more like—ah, I don’t even know the term—like a strike team; like a task force, right? We essentially would parachute into what team or organization needed or help the most. So, at a company that I was at, we were trying to basically understand who was having the most pain and trying to quantify that pain in some way.

And we found, actually, that one organization was having a lot of pain in server provisioning. Like, to get a VM provisioned to meet the needs of the business, it would take days to do this. Well, sounds like a great job for automation. So, we’re like, “Yeah, let’s do that. Let’s start building out some automation.”

This was years ago, so we’re using Chef—that was of the style of the time—and [laugh] then we noticed that when we wanted to deploy all the Chef stuff that they didn’t have any sort of continuous integration, continuous delivery system, so we shifted into building those functions out. And essentially, we were kind of moving into these teams, we were teaching them best practices, how to build things, how to do it right, leveling up their expertise, very bottom-up approach, grassroots efforts, getting people excited about these new technologies, giving them the time to learn it, and then moving on to the next team and the next challenge. And treating it as we’re not going to go in and build this thing and then run it for you forever; we’re really going to show you the capabilities and what’s available, and kind of get you out of your funk. When you’re in an enterprise and you’re down in your silo and you’re just focused on your one thing, you might not even know all the great stuff that’s out there.


Jesse: Yeah, I think that’s a really important point to share, socialize all of the information, all of the different processes that everybody’s doing because if you’re trying to solve a problem, chances are somebody else in the organization is also trying to solve that problem, and there’s no reason you shouldn’t be working together. And this really gets at the second point of the state of DevOps report about this question, which is, high performing organizations created those community structures; they created communities of practice and grassroots initiatives in order to bring folks together to solve these problems. Emily Webber wrote a fantastic book on building successful communities of practice—we’ll link it in the [show notes 00:18:01]—but she basically talks about communities of practice as a group of people who are gathering to discuss a shared passion. You can think of it as maybe a Meetup group; you can think of it as people who care about best practices together. One example that we saw was a client who had a massive Cassandra cluster internally, and there was a dedicated team who managed the Cassandra cluster, and then a bunch of other teams that were, effectively, using the Cassandra cluster for a number of things within their workloads.


And both sets of teams—both the team managing the cluster and the teams who were using the cluster—didn’t really have strong best practices, but they wanted somebody to step up and set some best practices. And they weren’t sure if they were going to be stepping on the other team’s toes if they did it, so they came together and started having this conversation to say, “Hey, these are things that we think are important as the team that’s managing the cluster,” and the people who were leveraging the cluster said, “Hey, these are the things that we think are important as the people who are using the cluster.” And they found common ground; they compromised, they created best practices together.

Pete: Yeah. I often try to think about, why is Amazon so successful? And it’s because of their ability to do what we’re talking about to their client base, their customers out there. They are building tools that their customers can consume, they are building best practices—the well-architected framework—how do you use this correctly? They give so much effort into helping the users of the service use it as best they can.

Do they do a perfect job at it? Of course not. I mean, they’re a huge place. But they do a lot more than you would expect them needing to. But that model is something that you can take and follow internally in how they create best practices, how they show how to you to do it.

Amazon does it go run your software you’ve deployed for you, but they will show, do you how to use all the tools correctly so that you can do it yourself, which is great. So, kind of given all that, what are some things that we would recommend, Jesse, instead of saying, “Oh, go build a cloud center of excellence?” What would we actually recommend instead?

Jesse: There’s this fantastic quote that I both love and hate at the same time, which is, “If it’s everyone’s responsibility, it’s no one’s responsibility.” I always struggle with this one because I believe that cloud costs should be everybody’s responsibility, but it’s true; if everybody is responsible for it, then I can say, “Well, you’re responsible for it, too, and if you’re not doing it, that I’m not doing it.”s and then nothing gets done. So, leadership needs to be accountable for cloud cost management, but they also likely need an individual or a team to champion cloud cost management, to float around—similar to Pete’s description of his experience—and create and foster that buy-in, ultimately creating that community of practice or that grassroots initiative so that everybody is on the same page together, everybody socializes together, everybody knows that they are not alone in solving these problems, and can really solve those problems together.

Dr. Nicole Forsgren was on Screaming in the Cloud recently, and she has some great things to say about this. She mentioned those types of solutions that focus on things building up communities of practice, building up grassroots efforts, and building up proof of concepts, those things will be resilient to reorgs and product changes. Those will last over time and help your organization create that lasting change that you ultimately want.

Pete: Yeah, that is a huge point is you want to build these grassroots efforts that will survive the inevitable reorganizations and product changes that are going to happen in your business. I mean, I’ve been at companies that would reorg every six months. There’s no way a center of excellence would have survived those reorgs. Those engineers, those people would have been retasked out elsewhere. But if instead you’re doing it grassroots, and you’re leveling up, and the whole rising tide lifts all boats, that can survive reorgs, and if any case it can thrive because you might end up getting these teams, and their teams are breaking up and reorganizing around and spreading more of this goodwill and knowledge across the company that has a real chance at being a force multiplier in that business.

Wow. Well, as you can imagine, this was a great question that Jesse and I both had a lot of feels on, and to the public sector coward, thank you so much for going to lastweekinaws.com/QA and sending us this question. This is an important one and we were really happy to talk about it. So, just a reminder, you can head to that same website, send us your questions; we’re going to keep answering them as we discuss the Unconventional Guide to AWS Cost Savings.

In the meantime, though, if you have enjoyed this podcast, please go to lastweekinaws.com/review, give us a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice, and then let me know how stupid I was to give myself the title of Director of DevOps. Thank you.

Fri, 12 Mar 2021 03:00:00 -0800
Corey Quinn’s AWS Beta Certification Exam Report

Want to give your ears a break and read this as an article? You’re looking for this link.



Help the show



What's Corey up to?

Wed, 10 Mar 2021 03:00:00 -0800
Flow Logs, She Wrote
AWS Morning Brief for the week of March 8, 2021 with Corey Quinn.
Mon, 08 Mar 2021 03:00:00 -0800
Tag—You’re It!

Links:


Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.



Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I'm Pete Cheslock.

Jesse: I'm Jesse DeRose.

Pete: And we're back again, Jesse. We are back. But really have we gone anywhere to begin with?


Jesse: We've been making our way slowly but surely through this Unconventional Guide. Lots of really interesting recommendations, lots of really interesting feedback from all of you, which we really, really appreciate. We can't wait to dive into some of those ideas deeper in future episodes.


Pete: Yeah. And don't forget, you can give us additional feedback and questions at lastweekinaws.com/QA, feel free to add your name. Or not. Doesn't matter. It can be totally anonymous. That's fine with us. So today, we're talking about a topic that is very near and dear to our hearts.

Jesse: Yes.


Pete: It is tagging.


Jesse: Yes.

Pete: Tagging your resources in Amazon, or I mean really any cloud provider; any place you can tag something you probably should. And we're going to talk a little bit about strategies for that, how people use their tags, just all the fun things related to it. Tagging, it's easy to do, right, Jesse? You just tag your resources and all your problems go away.

Jesse: Yep. Thanks, everybody, have a good night.

Pete: So yeah, if you've enjoyed this podcast, please go to—no, I’m just kidding.


Jesse: [laugh].

Pete: Tagging is probably the thing that most companies are doing poorly, simply because it's hard, and it's an afterthought, and if you didn't have a really solid forced strategy to ensure tags and force compliance, you're probably not going back to fix it.

Jesse: Yeah. It's not thought about as something that's a first-class citizen in the cloud world. When you think about the things that are important to your business model, you might think about getting your application out the door and running, maybe talking about business requirements for availability, failover, data retention, but tagging is nowhere on that list. That's not something that I think any organization thinks about as part of an MVP, let alone future iterations of their products.

Pete: Tagging feels much like the same feeling I get when my doctor says that I should eat more veggies.

Jesse: Oof.

Pete: I know they're good for me; I know we need to do this. They have vitamins, and fiber, and all these wonderful things. But in order to make those veggies something I want to eat, we have to learn to make it more delicious. Personally, I find duck fat works to make them more delicious. I wish we could apply a duck fat strategy to the tagging problem.

Jesse: Yeah, it's not an easy problem to solve. Or rather, I should say it is an easy problem to solve, but it's not something that anybody is quickly incentivized to solve. Tagging, just for the sake of tagging, it doesn't work.

Pete: Yeah, it's that there really are no incentives for it. No good incentives. It's usually because someone came over to your desk and said, “Hey, what's this charge for? And who's using it? And what's the deal with this?”

And you're going into Cost Explorer, and you're like, “Uh, I don't know. It's in this one account.” And that's as far as you can go to figure out who did what and why that thing is the way it is.

Jesse: Yeah. There are so many different tagging strategies that we've seen. We've seen some clients talk about tagging as a way to potentially penalize engineers who aren't tagging or who are spending too much money. We've seen organizations who are tagging to reward teams that are tagging all their spend or keeping their spend optimized. Across the board, there are just so many different ways to go about this.

Pete: So let's assume you are like most of the companies that we've seen. Definitely not all: there are some rare gems out there that are making tagging a long term and continual process, which we're actually going to talk about in a future episode, how to do that. But let's say you're just looking at your bill, you're looking at your usage, and you're saying to yourself, “Okay. I need to be better at this.” What do they say, “The journey of a thousand miles starts with a single step?” What is that first step?

Jesse: Yeah there's a lot of different ways to go about this. I think there's a couple great places to start. Now, I will say AWS has a thrilling 24-page best practices white paper that we’ll throw a link in the [show notes 00:05:18].

Pete: Have you read that, Jesse?

Jesse: I will say that I have read parts of it. I have not read all of it, and so I want to make it very, very clear to all of our listeners, this is not a document that needs to become the holy grail for your organization. I think in the same way that you could read the SRE book from Google and have some good takeaways, you can skim through this white paper, maybe read through a couple of the sections that seem most applicable to your organization, and then start with those ideas, start with those best practices, and then build them over time organically; develop them over time organically.

Pete: I like to read it some nights when I'm just having trouble sleeping, and maybe by page two or three I’m just out.

Jesse: Yeah. There's a lot of content in there talking about what to tag, why to tag. I think the best place for any organization to start is to think about what are the important things that we need to tag. And that's a conversation that's going to involve not just engineers, but also finance, potentially IT, maybe also security teams, depending on how your organization is built. Because ultimately, what you want to do is understand what are the things that my organization cares about when it comes to our cloud usage?

Maybe engineers care about which teams are using which services or they care about who owns which services. So, for example, when there's that i3 instance that somebody spun up and forgot to spin down, and everyone goes, “Well, it's not me, clearly you did it.” And then you realize that it's tagged with an owner tag that says that I did it, then you can't really argue with that. But then also think about maybe finance wants to know, what is the accounting unit for each of these resources? Is it usage for cost of goods sold, or COGS for example; something maybe in a production environment?

Is it in a development environment and associated with research and development, for example? Is it something that the entire organization needs to use? Like sort of a general or accounting section. And then ultimately they can use that information to break down spend from a financial perspective, for forecasting purposes, for business finance purposes, to really help better understand how the organization overall is using the Cloud.

Corey: This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the Enterprise (not the starship). On-prem security doesn’t translate well to cloud or multi-cloud environments, and that’s not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35 percent faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com/trial.

Pete: Yeah, tagging can really go beyond and capture a lot of information. You can store—what is it—255 key-value pairs for a lot of Amazon resources. Now, there are limitations, of course. The most effective way to get just 100 percent coverage and allocation on your spend is to do a count per maybe business unit or product. That is obviously pretty complex and can be challenging to do, but tagging can get you into the 90 percent range of coverage.

And those tags, I think to Jesse's point, what you mentioned was, by just tagging for the sake of tagging, don't waste your time. If you're looking at all these resources, and you're clicking around the Amazon tooling that can help you tag and categorize unless you have a plan on what you want to answer with those tags just don't even waste your time. But having those conversations with finance, having those conversations with your security teams, I've seen interesting use cases, not for an access control reason, but a way of just tracking these resources from a security perspective. One of my favorite uses of tagging was at a previous company we ran multiple accounts per environment. So, this was kind of earlier in Amazon where they didn't have a lot of tooling for running multiple accounts.

You wouldn't want to run 100 accounts on Amazon at this time; that was just way too much. The tooling just wasn't advanced yet. But we ran maybe four accounts. We ran our production, maybe our development, our QA, you know, a security account. Really, really basic.

But every once in a while, we had a need to run systems in our production environment that were kind of like test systems. They were not load testing, but sometimes we’d want to analyze large amounts of data, test out new versions of software, but that software running on all those servers isn't directly impactful with our cost of goods sold. So, when looking at the Amazon bill, your Amazon bill for production, it's going to look inflated. And your CFO is going to come over and say, “Why is the bill so high? Why did it grow so much?”

And so what we did is using tagging to basically identify what resources were running in production that we could basically subtract from our cost of goods sold, and that actually allowed us to, from a financial standpoint improve our gross margin numbers and make them as accurate as possible.

Jesse: So let's say you've talked to engineers, you've talked to finance, you’ve talked to product, security, IT, whoever else, you've built this great tagging strategy, this great tagging policy that you want to now enforce across the engineering organization. There are lots of different ways to enforce it. And again, a lot of engineers aren't incentivized to add tags to their resources, so in order to make sure that all of your taggable resources are tagged, there's a couple different things to think about. I think the number one thing is, think about, how can you automatically add tags, to your resources through your automation through your systems like your CI/CD deployment pipeline? Can you automatically add all of the tags related to your tagging policies based on a CI/CD pipeline that creates these? Or maybe infrastructure as code that automatically has these tags set?

Pete: Yeah, the key point is that no matter what you start with, it's going to change. It's going to change next year or the year after, it's not going to live—that one great strategy you created, it's going to live as long as it's going to live, but it's definitely not going to live forever. So, just getting started, I think Jesse, your point, integrate into your CI and CD systems, in your Terraform, your CloudFormation, however you provision these assets are a great way of doing that. And just make it default for net new services. You can go back and fix other things later, but just by making that default. But I think one other great thing is when you do come up with this great strategy, don't keep it a secret.

Jesse: Yes, absolutely. We'll talk about this one a little bit later in another episode, but please, please share it with everybody.

Pete: Yeah, exactly. You spent all this time to pull together all these different groups and come up with this strategy. Now, is the time to, again, pull together all these groups and let them know about it. So, these different ways of getting started, and Amazon tooling has gotten so much better on this to really help people identify their untagged, and tag some different things. And these tools are just so much better than they were before, that I feel like there's less of an excuse now for just tagging to be a very low percentage.

It's just so well integrated into the CloudFormations, you know, all these things. It's all there. It's all at your disposal. And if you think it's going to be a waste of time, Jesse and I can definitely, hopefully, call—I would say we can hopefully calm your fears that any investment you make into tagging—as long as it's well planned out with various teams—will pay just a massive amount of dividends in the future. You don't realize it yet.

If you're a startup, and you're two people, and you're tagging your items from day one, and it's like, “Ugh, it’s such a pain.” In two or three years from now, you're going to look back and be like, “I'm so glad we did that.” And hilariously, Jesse, we run a lot of applications, right—

Jesse: Yeah.

Pete: —in our own accounts. How do we do with tagging?

Jesse: Yeah, I was just about to say, we ran into this exact issue within our organization because we started spinning up resources without a clear tagging policy, and then all of a sudden, when our bill reached a point where Corey said we needed to know where these costs were going, we didn’t. We didn't have a tagging policy in place. So, we are a clear example of what not to do. And we realized, this is important. And we've seen this in other organizations as well. It is never ever too early to start tagging your resources in AWS.

Pete: The cobbler's children have no shoes, basically.

Jesse: Yeah.

Pete: It's so true. And what's funny is that, largely, we only started to care when the credits ran out. Which is incredibly common. No one cares what you're spending when you're a startup and you've got your two years of credits, or whichever program that you followed to get those free credits. You got your credits from setting up your company on Stripe, and you got your Amazon startup credits, all this other stuff. Yeah, go nuts. Just provision things. There's no bill. And then the month that those credits run out, and you're like, “Huh. Shit.” [laugh].

Jesse: Yeah, I didn't realize that we were, uh… spending that much money over there.

Pete: Exactly. So, we could definitely say we've experienced it from our previous companies, places that we've had to adopt a tagging strategy later, or even if just tagging wasn't well supported until later on and then you have to add it, it is so much harder. So, the earlier you can start, the better. But don't fret, if you're later in your company's lifecycle and you don't have anywhere, just get started today. Start having those conversations. That's the first step is to start having those conversations.

Jesse: Absolutely. And to Pete's point earlier, this tagging policy is going to change over time, so make it a point to reassess this information, maybe on a quarterly basis, maybe on an annual basis, to course-correct over time because these tags are going to change over time. The needs of the organization are going to change over time. And that's fine. That's absolutely valid. So, make sure that you just put that on the calendar now so you can have those conversations.

Pete: Yeah, and we will be back next week to talk more about this and, once you've started tagging, how to improve and continually improve upon that and what strategies you can follow.

So, if you have enjoyed this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice and tell us how good you're doing it tagging. Again, I always like to remind folks you can go to lastweekinaws.com/QA. Send us your questions. We would love to hear from you, and we'll be answering those in future episodes. Thanks again.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 05 Mar 2021 03:00:00 -0800
Two Views of Lambda Diverged in a Yellow Wood

Want to give your ears a break and read this as an article? You’re looking for this link.



Never miss an episode



Help the show



What's Corey up to?

Wed, 03 Mar 2021 03:00:00 -0800
Firewall Transit Gateway Dingus
AWS Morning Brief for the week of March 1, 2021 with Corey Quinn.
Mon, 01 Mar 2021 03:00:00 -0800
Humans Are the Most Expensive Part of Cloud

Links:

Transcript


Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Corey: Ever notice how security tends to be one of those things that isn’t particularly welcoming to folks who don’t already have the word ‘security’ somewhere in their job title? Introducing our fix to that, Meanwhile in Security. To sign up for the newsletter or to find the podcast, visit meanwhileinsecurity.com. coming soon from The Duckbill Group.


Pete: Hello, and welcome to Fridays From the Field. I'm Pete Cheslock.


Jesse: I'm Jesse DeRose.

Pete: And we're back, again. We're continuing our series, the Unconventional Guide to AWS Cost Management. And as always, if you have questions, as we are going through this series and want to learn more, go to lastweekinaws.com/QA. Thank you to all of those who have already submitted questions.

Jesse: Yes.

Pete: Really great ones coming in.


Jesse: Thank you.


Pete: We're going to take a couple of episodes in the future to answer those questions and really dive into them. So, keep them coming. We really love them so far. So Jesse, what are we talking about today?


Jesse: Today, we're going to be talking about one of my favorite topics, which is that humans are the most expensive part of Cloud.


Pete: Yeah, we hear this quite a bit. I mean, not just in salary, right? This is the line that usually is mentioned when we talk to folks about their Amazon spend. They say, “Well, outside of salary, Amazon is our most expensive bill.”

Jesse: Yeah.


Pete: That line has been repeated more times than I can count.


Jesse: But what's so fascinating to me is that this really gets at the idea of total cost of ownership. I think that's ultimately what I really want to focus on for just a second. Total cost of ownership is thinking about all of the spend related to your cloud costs. Now, when you think about cloud costs, you will generally think about just the usage that you have within AWS, maybe some discounts from either an EDP or PPAs. But are you thinking about how much time it's taking your engineers to manage all of that usage, manage that infrastructure, manage the deployment pipelines that are living within the cloud? Are you thinking about all of those components and the cost of those components alongside your usage?

Pete: Yeah, exactly. I think engineers are bad at this.

Jesse: Yeah.

Pete: Myself included. But this is something where we want to build things. That's why we're in this industry. And it's fun to build things. Maybe not so much fun to, kind of, ongoing manage those things. Looking at you, Cassandra and Elasticsearch clusters.


Jesse: [laugh]. Yeah, it's this idea that there are definitely opportunities for engineers to spin things up and manage things on their own when you want to build that Kubernetes cluster and learn how to manage a Kubernetes cluster, learn how to build a Kubernetes cluster. That's great. We don't want to stop you from building and learning at all. But when you're building infrastructure for your organization, for your teams, for your products, is it going to be more cost-effective for you to build this solution yourself, or is it going to be more cost-effective for you to leverage existing managed services within the cloud?

Pete: I like to call it operational FOMO, you know, the fear of missing out. And I think a lot of engineers suffer that when it comes to the new hotness, the new stuff. Kubernetes is a great example. I mean, I feel like a lot of those people were also equally like, “OpenStack is going to be the best thing ever.” And then it didn't.


But I like to think of my time at a previous company where we deployed into the Cloud, specifically Amazon, and there was a fear that was, again, we've mentioned this before, it's an irrational fear about vendor lock-in. And that fear forced us into building forced us only using core primitives: S3, EC2, EBS, really. We really didn't use much more than that. I mean, obviously, the networks and stuff go in there. And the idea was, is that oh, well, we have this portability.


And we—Duckbill Group, Corey, we've all talked about it, written about this. It's a fallacy. You're locked in for a lot of other reasons that I'm not going to go into right now. But because of that, we became very good at running our own databases and specifically consuming a large amount of time-series data. It was a security event application.

And so one of the interesting flip sides of this outcome is that we ran our own monitoring infrastructure. I didn't pay for Datadog. They called me every single day and I was like, “My metrics infrastructure cost me $1,000 a month. You're going to charge me $50,000 a month. Even if you discounted that by half, I still am going to pay a lot more.”

And the reality was, is that we became so good at managing these systems, we didn't need those services. But I always think back at like, at what cost? How much more time could we have invested in the application, the product, how we deployed it, availability, all that stuff, if we hadn't had to invest so much time into running our own Elasticsearch, running our own Mongo, our own Redis, our own Cassandra? We spent a lot of time doing those things.


Jesse: Yeah, there's a lot of opportunities to leverage managed solutions for those things. Because, again, part of it is this idea of your engineers don't have to spend time managing this infrastructure; they can spend time on other things. But also think about what are the other cost components of this architecture that you may be able to leverage by using a native or a managed AWS service? For example, if you look at Amazon Elasticsearch—is it ‘Amazon Elasticsearch?’ Is it—


Pete: I always forget if it's ‘Amazon Elasticsearch’ or ‘AWS Elasticsearch.’ And oftentimes, it doesn't feel like a rhyme or reason why they name it the way they do.


Jesse: Well, let me put it this way. If you look at the managed Elasticsearch service on AWS, you don't end up paying for some of the things that you might pay for if you were managing that infrastructure yourself, like data transfer, for example, like this infrastructure management that we talked about. So, there are other reasons why you might want to leverage native services. And again, it gets back to this idea of total cost of ownership, how much is it actually costing you to run these things on the AWS primitives, for example? How much are you actually spending among not just the compute usage or storage, but on data transfer, on the engineers who are spending time managing this infrastructure? What kind of other things could you be working on instead, during that time?


Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Check out CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.


Pete: Yeah, that's a really great point about the cost of some of the managed services, specifically that replication data of Elasticsearch is going to be included. That is a thing in other services as well. RDS is another good example. And that is a big component of a lot of folks’ Amazon bills. I mean, we see a lot of Amazon bills. And I know I've said this before, but I can tell if you're running Elasticsearch or Cassandra without you telling me that.

I can just see it in your network data transfer. Conversely, I was actually shocked recently. I looked at one of our client’s bills and their usage and saw a disturbingly low amount of data transfer to the point that I was a little worried. Do they have any, like, availability requirements? Why are we not seeing a large amount of cross-AZ data transfer?

And it turns out, they were leveraging really heavily a lot of the Amazon managed services where some might say it's free, some might say it's baked into the cost, but you have to think about that. You might look at Elasticsearch at the Amazon offering managed service and say, “Wow, this is really expensive. It's a lot more; I can just run it myself.” But you have to add in all of those things. And to Jesse's point, too, if I don't have to manage setup, deal with all of the intricacies of a distributed database and I can just outsource that, then I can go on and maybe improve some other part of my infrastructure that is waking me up in the middle of the night.


Jesse: Yeah, I think another thing to think about in this context is not just how expensive is it for engineers to manage some of this infrastructure? But what kind of business risks are you looking at by asking your engineers to spend time managing this infrastructure rather than allowing AWS to manage this infrastructure natively? Specifically, there's a client that we worked with where they ran a bare-bones Kubernetes cluster on EC2 instances, and they had this amazing mature model for cost management on that Kubernetes cluster, cost attribution for that Kubernetes cluster. But all of this content ran through one person, and that led to a potential business risk. It wasn't just a matter of, it's expensive for this person to be doing all of this work managing all this infrastructure, but it's also a business risk for the business to rely on this single individual to have all of this knowledge.

If this person left the company, for example, nobody would have any idea how to manage this infrastructure, or how to attribute costs in this infrastructure or gather the financial data they needed month-over-month to attribute costs back to different teams and to review other metrics.


Pete: Yeah, I think a lot of folks, too, maybe they feel like they're giving up a sense of control? Or maybe it's a real fear, maybe it's not. I don't know. But the services that exist now on Amazon for even running other things, like I'm always a little shocked to see folks who are starting on the Cloud right now start on EC2, specifically outside of the lift and shift model. If you're lifting and shifting, yeah, yeah, you're moving to EC2. That's obvious. But if you're a brand new company, just going on to the Cloud, EC2 should be probably the last service that you're setting up.


Jesse: Yeah.

Pete: You got Fargate, EKS, ECS, there's so many ways to run containers. And that's just easy. It's just easy to do. And it's a great way to get started. But I even look at things like the databases, as well, that allow you the ability to get started really easily and really quickly with maybe, like, a T class RDS instance.


You can change the engine size later, as your scale grows, you can increase the disk later, as it grows. That's a really interesting way to get started at a really, really low cost. You can always add more later, versus, again, in the classic data center world, buying a bunch of really big servers hoping that your infrastructure was going to grow. It's like the video game, the old world of online gaming and video game companies, they would buy all these servers for launch day, and they still wouldn't have enough. And then over time, the usage of that game would go down and down and down, and they were left over with all these servers. So, being able to start small and grow is a great way to just see how people actually use your application.

Jesse: Absolutely.

Pete: Yeah, at the end of the day, I think what we're really getting to here is that more broadly, folks should really fear less [laugh] about the managed services. Whether it's a managed service on Amazon or you're using Datadog, I mean, this concept of vendor lock-in as a way of not using the easiest service is just a really sad state of affairs to hear so many people still say this. In many cases, I say places like MongoDB, their Atlas system, they are the creators of this. Theoretically, they're the best place to get that service from. So, are you locked into them? Well, yes. But your business is locked into Mongo because some engineer provisioned it in 2014 as a side project and now you're still running it. So you're locked in with all of these decisions you make. You might as well go and use the service that is just the easiest to use.


All right, well, if you've enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating and tell Jesse why you loved it so much. I mean—

Jesse: [laugh].


Pete: —hated it so much. Also, do not forget, we are still taking questions. We do want to hear your feedback. Send us a question, you can add your name or not, to lastweekinaws.com/QA and we'll answer those in a future episode. Thanks again.


Announcer: This has been a HumblePod production. Stay humble.






















Fri, 26 Feb 2021 03:00:00 -0800
Setting the Record Straight on the 'Very Funny Cloud Computing Billing Expert'

Want to give your ears a break and read this as an article? You’re looking for this link.


Never miss an episode



Help the show



What's Corey up to?

Wed, 24 Feb 2021 03:00:00 -0800
The World Thinks I'm Funny, AWS Disagrees and Commits
AWS Morning Brief for the week of February 22, 2021. with Corey Quinn.
Mon, 22 Feb 2021 03:00:00 -0800
Infrastructure Code Smell (aka Who Microwaved the Fish?)

Links:


Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Pete: Hello, and welcome to the AWS Morning Brief. I’m Pete Cheslock.

Jesse: I'm Jesse DeRose.

Pete: Fridays From the Field, Jesse. We're back again.

Jesse: Back, back, back again.

Pete: I always say that when I rage quit computers, it would be fun to be a farmer. And so maybe this is a little trial run “Fridays From the Field.” I'm just out in the field.

Jesse: So basically, what I'm hearing is that you are the old man out in the field, yelling at the clouds as they go by.

Pete: Well, now that I work from home pretty much all the time as part of Duckbill, but also due to COVID. I do yell at the squirrels who constantly tear up my yard. I've now turned into that person.

Jesse: [laugh]. Oh, oh, Pete, I'm so sorry.

Pete: Those squirrels. I hate them. So we're back again, talking about the Unconventional Guide to AWS Cost Savings. And this time, we're talking about ‘infrastructure code smell.’

Jesse: Ooh, fun one.

Pete: I like to equate this to, who brought the fish for lunch and microwave to that?

Jesse: I always understood that at a deep core level, but didn't really think about it until I actually did microwave fish one day, and I regret everything.

Pete: Don't do it. I'm telling you, folks, don't do it. You can bring tuna fish in. I guess that's fine. That's a little bit better. If it's packed in oil, it actually is a lot less smelly. Should we do a food podcast? No, I’m just kidding. [laugh].

Jesse: [laugh].

Pete: So ‘code smell,’ I do want to bring this one up because I actually did a little bit of a TIL—today I learned—with code smell. Yeah, this term was actually coined by someone that was a writer about the agile software movement, Kent Beck. He was working with Martin Fowler, who's a noted author about programming. In the book called Refactoring, they coined this phrase ‘code smell.’

Jesse: I did not know this.

Pete: Yeah. You know, you kind of hear a term, you just accept it without really understanding why. But what it was called in this book was, code smell is a surface indication that usually corresponds to a deeper problem in the system. So obviously, it is what it sounds like: something smells. Something doesn't seem good here. And obviously, it can take a lot of forms. You most often hear it in, obviously, software engineering but, guess what? Software engineering has expanded to manage our infrastructure, right?

Jesse: Mm-hm, absolutely. Yeah, it's not just about—or I should say, infrastructure smell is not just about wasted resources. It's really thinking about all of those one-off hacks that got you this far. So, that one time that you couldn't deploy something into production, so you just said, “You know what? I'm just going to log into the console and spin up that instance, and then call it a day, and close the change order, and be done with it so I don't have to worry about it. Maybe I'll open a ticket to see if I can figure out what happened in the deployment pipeline, but I'm not going to worry about it.” All those little things that you did along the way that aren’t probably the best practices that you ultimately should be following and ultimately want everybody else to be following.

Pete: Yeah, and I'm looking at you, software infrastructure manager, who is still running an m1.medium in production. That's code smell.

Jesse: Oof.

Pete: Anyway. Just don't use the m1.mediums. Let them go away. But, Jesse, you're right. It's not just those hacks and one-offs. It's kind of back to the context. It's the how. How you're doing certain things with these Amazon resources, right?

Jesse: Yeah. And I think that's something that's a really important caveat, the call out because there is always a balance between premature optimization and waste. I struggle with this one a lot. My brain automatically thinks, “Well, if I'm going to do this, I'm going to do it the right way the first time, and I'm going to do it the streamlined automated way the first time so that I can just have it all set up the very first go, and set it and forget it and be done and walk away.” But in most cases, that's not how it works.

Pete: Yeah, that is a complicated topic that I've struggled with as well. I've worked for predominantly unprofitable startups. We have a burn rate. We have only a certain amount of money in the bank and you divide by what your spend is, and that's when you're out of money. And doesn't necessarily mean the company's out of business, but it could mean that all that sweet equity that you have no chance of actually turning into real cash has even a less chance of turning into real cash. So, we often in the startup world make those decisions where we try to just get it done in what we hope is the best way possible. Again, we'll regret it two or three years later, but—

Jesse: Regardless of the way you set it up the first time, we will regret it two or three years later.



Pete: It's so true. Even if you say, “I’m going to set this up in the best way possible,” things change, and scale breaks everything eventually. So, in a couple of years, you're just going to be doing things in a different way—for better or worse—than you were doing. And it's kind of just all for not, in many cases.

Jesse: One of my favorites that I see is application logs that are pushed into CloudWatch because you want to be able to see all of your logs in CloudWatch or all your metrics in CloudWatch. But then those same logs and metrics are then being sent off to Kinesis for analysis, they're being sent to Splunk for analysis, they're being sent to Datadog, or insert other third-party vendor here for analysis. So effectively, all you're doing is putting the data into CloudWatch as a cue to go to somewhere else. And CloudWatch isn't cheap. CloudWatch logs are expensive.

Pete: Exactly. This is one of my most frustrating painful-to-see, dare I say anti-pattern of Amazon usage is, partly Amazon to blame on this one because they do make it so easy to get your logs into CloudWatch. It's a default option. If you turn on flow logs, you can have your flow logs go to CloudWatch. God forbid you do that, because your bill will be horrific in short order. But a lot of those services also have the ability to push to S3, as well. So, highly recommend, unless you're using CloudWatch for log analysis, push your logs to S3. In a previous episode, we talked about the data bagel, right, Jesse?

Jesse: Oh, the data bagel. My favorite.

Pete: Push all your data into the singular location—S3. It is very cheap, in many cases, free to do so, and avoid all of this kind of data duplication by sending it to a bunch of different places.

Jesse: And I think it's important to note that this can happen with any product initiative. It's not just the old stuff that you spun up back in the day, and you go back to look at that one line of infrastructure code or that one m.1 instance, and you think to yourself, “Oh, no, what idiot spun this up? I can't believe we still have this m.1 instance running. Who did this?” And if you go look at the tags—which of course you put tags on this thing—you find out, you did it.

Pete: It's me.



Jesse: Whoops.

Pete: Just at me next time, Jesse.

Jesse: Yeah. So, it is important to think about this, not just for the old infrastructure, but also for the new infrastructure that you're going to be building. Consider the cost and usage impacts before you start building. This kind of overlaps, again, with a concept from a previous episode where we talked about context is king. When you look at your application architecture and your infrastructure diagrams, think about all of the components that you're actually going to need to run your workloads, whether that is the actual compute resources, whether that is the databases, whether that is the logging structures. All of these components are important things to think about before you deploy.

Pete: I feel like this is the astronaut meme, where there's the astronaut with a gun holding on to the other astronaut. He's just looking at Earth going, “It's all context, isn't it?”

Jesse: [laugh].

Pete: Always has been.

Jesse: Yeah.

Pete: True. It's true, though, right? I think that's a really great point. Some of the most mature organizations that we work with actually bring us in to review architecture planning documents as they are building services to better understand what the cost impact would be in advance, those new product initiatives. They might have a thing—and this speaks a lot to lift and shift, which I know we've talked about many times in the past where you've lifted and shifted your workloads over and now you're trying to improve upon them.

Part of that improvement should actually reduce your costs. Right? Now, not always. Sometimes you're just having a better user experience, and less downtime, and less PagerDuty paging you, but if you can also do all of those things, plus save some cash, that money could be invested in other interesting projects.

Jesse: And it's also worth thinking about standardizing some of these procedures, documenting some of these procedures, maybe creating grassroots efforts or communities of practice around these procedures, around these ideas and norms because then, if you are running into these issues it's likely that somebody else is running into these issues, too in the company, especially if you work in a large enterprise company. And now you have the opportunity to bring multiple minds together to brainstorm and build these best practices together, and build off of one another and really help each other move forward together.

Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Check out CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.

Pete: If you're not sure how much something is going to cost, either, it's a great opportunity to run some proof of concept workloads to determine it.

Jesse: Absolutely.

Pete: We were working with one client who was going to be doing a large batch processing job, they wanted it to be as cheap as possible. Well, obviously, you want to use spot instances for things like that, things that can handle interruption, but even then, they really had no way of gauging the cost. What was the cost to the business? So, the business has this thing they want to do. They want to do this large batch processing, and they're saying, “Well, what is it going to cost us? We want to invest a certain amount of money in doing this, but if it's too much, maybe we don't want to do this.”

And so what this client did was run a series of these batch processes, continuously optimizing the code, until they got to the point where the code was really as optimized as they believed that they could do in the time needed, and now they wanted to get to optimizing the infrastructure side, right-sizing instances, extending spot usage, all of those different things that can give them the closest estimate possible of a defined period of time that they want to process. And then they can use that to essentially forecast out, this is the approximate cost plus or minus some amount of flexible difference in spend, and they can have confidence that their executive leadership is making the right decision. So, Jesse, what are some helpful tips that we've seen that folks can just right now actually go out there and hopefully improve some of the smellier bits of their infrastructure?

Jesse: There's so many. To start off, think about the log data retention and snapshot lifecycles. Think about how long do you actually need to keep your log data and keep your database snapshots, your EBS volume snapshots. And again, this may be a conversation with legal or with IT to understand those requirements, to understand, do we need to keep this data for some period of time for legal purposes? And then build your snapshots accordingly.

I remember there was one client we worked with who had really large, I think it was CloudWatch spend—or really large VPC spend, I forget which. And when I dug in a little bit further, I realized they had VPC flow logs enabled for one of their VPCs, which you should absolutely do if you want to investigate data for a period of time for the data that's flowing through that VPC. But, one, they never turned it off. And, two, they never set a data lifecycle policy. So, that data was just going up, and up, and up, and up, and just growing larger and larger on their AWS bill at the same time.

So, a really quick way to really make sure that you're not just storing data unnecessarily: look at those lifecycle policies, see if you actually need to retain all this data as long as you've got it. And if not, you can start getting rid of it earlier. You can lifecycle policies are fantastic because you can set them and forget them.

Pete: Exactly. The Amazon services allow for you to have those lifecycle policies. You don't really have to think about it. But yeah, Jesse, to your point, be sure you talk with some additional folks before you start deleting data as you could violate some of your SOC 2-related compliance needs if you are not holding the right amount of data. Something else that I think is missed out a lot with clients that we speak with is Compute Optimizer.

There is a Compute Optimizer within your EC2, within the ability to analyze the CPU usage. Now, you'll need the CloudWatch agent to give memory recommendations, so its value might be a little bit limiting if you're maybe memory heavy but light CPU. I guess in that case, you probably want t class instances, but it does include, now—this is a recent thing—EBS recommendations as well. And EBS is probably one of the places that is the greatest gain for a business on their costs. I mean, if you’re running a lot of EBS, odds are you're probably writing a lot of gp2, but guess what?

The workload that you're running is pretty probably better off on a different volume. And even Amazon says gp2 is general purpose. It's where you start, to understand and analyze the workload to then move it to a more appropriate volume: sc1, st1. These are significantly cheaper volumes that can still meet your kind of I/O needs. And it's awesome that now Compute Optimizer can include those EBS recommendations. The beauty of those EBS recommendations? These volumes can be modified on the fly. You go right into the UI: click box, save money.

Jesse: It's amazing.

Pete: And again, any of those times that there's a click box, save money, that is just a great feeling. And the fact that Amazon can analyze for you and make these recommendations, now you have this confidence. And guess what, if you screw up and you accidentally move something to a sc1 volume, and it's not performing as well, you can change it again. I think they're modifiable up to once every six hours. So, definitely check that out. I think that is a big win for a lot of folks.

Jesse: I think it's also worth noting, you talked about EBS tiers, it's also worth noting that S3 has tiers as well. And we've talked about this in previous episodes, but stop using S3 standard storage for all of your data storage. There are definitely use cases for S3 standard storage, but there are also lots of use cases for the other S3 storage tiers as well, especially infrequent access, or possibly intelligent tiering. There's great use cases for the archive tiers that were released recently, and then also for Glacier as well for maybe some of that data retention that we talked about earlier. So, move that data to the appropriate tier so that you're not spending as much money on it as if it all lives in the standard tier. The standard storage tier is great; it's essentially the general-purpose tier of S3, but there are other tiers that you can leverage as well.

Pete: Yeah, that's a great point. I love the intelligent tiering. For most workloads we see, that should be essentially the default storage location because it's getting rare that you can have a passive application on Amazon automatically save money for you. And that's what intelligent tiering will do. Now granted, if you are running lots of small files, then the monitoring costs of intelligent tiering could actually be prohibitive, so keep that in mind.

But it has the ability for you to set these archive tiers, Jesse, like you said before, and you can configure them based on specific timing. So, maybe you eventually do want things to go to Glacier but not for six months instead of 120 days. You can modify and adjust that. And then it will make these tiering decisions for you and move things into different places as needed. It is definitely a game-changer service that more folks should be using. And it just happens, right? It just happens in the background, which is fantastic.

All right. Well, hopefully, those tips are helpful to you. As a reminder, you can always go to lastweekinaws.com/QA if you have questions, or maybe there's a service that you have a lot of spending that you're just not sure how to, maybe, improve that one. We'd love to read those questions and take a shot at answering them.

But if you did enjoy this podcast, please go to lastweekinaws.com/review. Give it a five-star rating on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating but then also tell Corey that you want to see him back on Fridays From the Field. Maybe we could have him as a special guest, Jesse. What do you think?

Jesse: Oh, that would be fun.

Pete: You know, he can come and visit his former podcast.

Jesse: We can show him all that we've built from his empire.

Pete: [laugh]. “Look. Look at what we've built for you, Corey.”


Jesse: “Look, we've built a data bagel, just for you.”

Pete: Enjoy your bagel. Thanks, everyone. Buh-bye.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 19 Feb 2021 03:00:00 -0800
The Future of AWS Marketing is a Good Story

Want to give your ears a break and read this as an article? You’re looking for this link.



Never miss an episode



Help the show



What's Corey up to?

Wed, 17 Feb 2021 03:00:00 -0800
I Hope I'm Failing the "AWS CFO Sniff Test"
AWS Morning Brief for the week of February 15, 2021 with Corey Quinn.
Mon, 15 Feb 2021 03:00:00 -0800
Listener Questions 1

Links:


Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.


Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I'm Pete Cheslock.


Jesse: I'm Jesse DeRose.


Pete: We're back again. Hashtag Triple F.


Jesse: It's going to be a thing.


Pete: We're still trying to make it a thing. Desperately trying to make it a thing. Otherwise, we're just going to look like fools, Jesse, if it's not a thing.


Jesse: Oh now, I wouldn't want to look like a fool, you know, next to anybody else in my company.


Pete: [laugh]. It definitely seems to be the one that trait you need to have to work at Duckbill is, to be okay looking like a fool. So, we are midway through the Unconventional Guide to AWS Cost Optimizations, cost savings. And we have been sharing a link on pretty much if not all of these recordings where you can send us feedback. And you can send us questions. And someone finally sent us a question. I think people are listening out there, Jesse. Isn't that great?


Jesse: We have one follower. Yay.


Pete: It's amazing. So, we are really happy that someone asked us a question. You can be the next person to ask us a question by going into lastweekinaws.com/QA. That's not our quality assurance site for testing, new branding things, and new products. QA is for ‘question and answer.’


So, go there, check it out, drop in a message, you can put your name there or not, it's totally fine. But this first question—well, first, I need to actually, I need to admit something. I'm lying right now. This question actually came in months ago. We saw it and thought that was a great question, we should answer it at some point. And then we forgot about it. So we're bringing it back up again, and I think it's relevant so I don't feel too bad about it.


Jesse: Yeah, we saw this question around the time that we started recording the entire Unconventional Guide series. And apologies to this listener. This is a very good question. We want to talk about it, so we are talking about it today. But it took a little bit of a time for us to get to this.


Pete: But you know what? We made it. We're here.


Jesse: We’re here.


Pete: We're here. So, Nick Moore asked this great question. He said, “Hey, Pete and Jesse. Very much enjoying your Friday segment on the Morning Brief.” Thank you very much for that. “If possible, I'd like to hear you talk about your experiences with cost optimization for quote, ‘big data’ projects in the cloud, i.e. Using platforms like Hadoop to process large and complex data, either using pass—like, EMR or [IS 00:03:03]. Is this something that your customers ask about often/at all? And how do or would you approach it? Thanks, again.”


Well, hey, this is a truly awesome question. And at a high level, many of our clients actually are pretty heavy users of various Amazon services for their, kind of, big data needs. And big data, it's all relative, right? I mean, to some companies, big data is in the hundreds of terabytes, to other companies it's in the hundreds of petabytes. It's totally relative, but at the end of the day, it's going to be a challenge, no matter how big of a company you are. Your big data challenges are always a challenge.


Jesse: You've got some kind of data science or data analytics work that you want to do with large data sets. That may be large datasets comparatively to the work that you're doing; that may be large data sets comparatively to the industry. Doesn't matter. Either way, it is big data projects, and there are many, many, many, many solutions out there.


Pete: What's interesting, too, is I think the reason that this has grown in prevalence over the last year, more of our clients have been using more of these services is simply because the barrier to entry on these projects, on these engagements, is so low. You can get started on Amazon with some Athena and Glue, maybe some EMR, for just an incredibly low cost. And also, from a technical standpoint, it's not that challenging. I mean, as a good example, most reasonably technical people could take their cost and usage report, get it integrated into Athena using AWS Glue in minutes. I mean, without using CloudFormation. I mean just clicking through to set it up. And honestly, for some clients, their cost and usage reports, and that's a big data problem. That could be—if you're not storing it in Parquet, if you're actually storing it in CSV because you're a mad person, those could be hundreds of gigabytes a day in volume.


Jesse: Yeah. So, when we talk about big data tasks, there's a couple different services that we generally see folks using within AWS. We generally see S3, Kinesis, and most obviously, EMR.


Pete: Yeah, exactly. And we're seeing new services like Kinesis, expanding on Kinesis: Kinesis Firehose, when that came out; people are using that for some of their big data needs, especially when trying to stream data into S3. That's a really powerful feature that Firehose can do. And then, once it's in S3, the question that our clients often ask is, kind of, “What do I do with it now?” And if we dive into just S3, and you've got your data in S3, where are the kinds of places that we see unnecessary charges for data warehouse tasks?


Jesse: Honestly, it's unfortunately kind of both of the major places that you're going to be charged for S3 which is, for your storage costs, and for your requests.


Pete: So, what you're saying is that all S3 charges are unnecessary. [laugh].


Jesse: Just get rid of it. Just put all that on EBS volume somewhere. Turn off your S3, you're solid.


Pete: Exactly. It is kind of funny, but it's true. I mean, there's ways to abuse both of those pricing models, whether it's storage or requests. The first place that we honestly see a lot of this is just people are data pack rats. And let's be honest; I'm one of them as well, I have a NAS setup at home with, like, 30 terabytes of hard drives on it.


I don't throw anything away digitally. Turns out most of our other clients are the exact same way, and sadly, a lot of them use standard storage for S3, which we talk about often. It's common: you get started with the standard storage, that's a great place. But for big data tasks, it's often the wrong storage solution, especially for data that maybe has already been transformed and is stored in a more efficient format; maybe it's queried infrequently. There's two ways to solve this one.


Obviously intelligent tiering can be a big help to automatically tier your data to the right location. But another thing that you can do, if you're already running some EMR clusters, you can set up a Spark task to automatically tier data to lower-cost locations really easily, and then you can avoid the intelligent tiering monitoring costs, kind of using an infrastructure you already have. So, the key thing that I always like to point out is when you're done with the data, move it to a cheaper storage or delete it. I mean, Glacier Deep Archive, and deleting it is almost the same price. Like that's how cheap Glacier Deep Archive is. If you're not sure if you're going to need it, or maybe compliance says you're going to need it, just deep archive it and move on with your day. But whatever you do, don't just leave it on standard.


Jesse: Yeah, so then if you think about this, from the request perspective, there's a lot of get and put requests when working with data in S3. Obviously, you are putting data into S3, you're pulling data out of S3, you're moving data around. We see this a lot, especially when folks use the Parquet file format. Now, again, we do recommend the Parquet file format, but there are ways to optimize how large your Parquet files are. So, for example, imagine your Parquet files around 100 megabytes in size.


To complete a query, you need to make about 10 get requests to access 10 gigabytes of data. But if you right-size your Parquet files to about 500 megabytes to 1000 megabytes in size, you can cut those requests by 50 to 90 percent. And in many cases, we've seen clients implement this without any impact to their production workloads. So, keep in mind that it's not just about moving the data—getting and putting the data—it's about how often are you getting and putting the data? How large are those requests sizes?


Pete: Yeah, exactly. Because I know there's someone out there that's probably doing the math on what you just said, I think you meant 10 gets to access one gigabyte of data versus 10 gigabytes of data. Someone out there is doing the math on that. They're about to go to lastweekinaws.com/QA to tell us about it. But hopefully, I have preempted that.


Jesse: Yes, thank you, thank you.


Pete: But of course, it's an important point that we've actually seen in some scenarios, the request costs exceed the storage cost for a bucket. That is what we would consider to be an outlier in spend, and it's not needed in a lot of cases. So, think about that. Think about your file sizes. And likely you can increase the size of it.


And this is a good test as well. This is something you can try out. Try it different file sizes, try different queries, see how it impacts performance. You could get some pretty dramatic cost savings by adjusting it, like you said, Jesse. So let's talk about Kinesis, which is great. It's like one of the best service I really love—well, I think most people thought it was the best service prior to that—


Jesse: Yeah…


Pete: —outage last year.


Jesse: Yeah…


Pete: A lot of folks that we spoke with, a lot of our clients, they were all in on Kinesis. And honestly, the outage has a few of them just maybe slowing, pumping the brakes a little bit, or rethinking their usage. So, that sentiment has changed a little bit. But we have seen a couple of places where Kinesis savings can be had by just looking a little bit closer on how you're using it. So what's one of the first things that we've seen, Jesse, around some of the Kinesis cost savings?


Jesse: One of the biggest things we've seen is Kinesis data duplication. You have data that needs to go to different places to be tweaked, analyzed, moved about this way and that, but ultimately, it's all coming from the same data source. If you have multiple Kinesis streams that all contain the same data, you're paying for each of those streams individually. You don't need to. Instead just have a single Kinesis stream with the enhanced fan-out option, which essentially allows you to have that single source of data, but then there could be multiple consumers that are receiving that data for their analysis purposes.


Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Check out CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.


Jesse: And similar to S3, we've also seen really high storage costs in Kinesis streams. Think about reducing the retention for your non-critical streams. Think about, do you ultimately need all of the same amount of data that you're sending through each of these streams. In some environments, you may need all of the data in every stream possible, but in some environments, you might not. Some environments, you might be using maybe just a smaller subset of the data, so you don't need to move all of that data from place to place.


And don't forget about compression; compression is something that we've seen, many, many clients either not enable, forget to enable, or maybe they just don't have the best practices in place. Maybe it's something that nobody has stood up and said, “Hey, this is how we want to move our data, and this is important for us to optimize this spend.” Start doing that today. Be the first person in your team or your organization to say, “Let's put data compression on our Kinesis streams.” It will help you save money. Also consider there's binary formats like a Avro, Thrift, the protocol buffers, I mean, if you just end up shoving uncompressed JSON into Kinesis, you're just going to have a really, really bad time.


Pete: Yeah, and we've seen that, we have, and it's amazing when you move from an uncompressed JSON into a binary format how much that reduces it. What's important too, I always like to call out is the downstream effects of compression and basically moving away from uncompressed JSON, which is large data transferred over the wire. There's downstream effects too in network data transfer and I/Os and all those other good things.


So, for some workloads, though, oftentimes, we actually do recommend Kafka. Now, it depends on how you're using Kinesis. There are some Kinesis features that do not directly translate over to Kafka. But for a lot of folks, Kafka, even the managed service for Kafka, that MSK service, is highly recommended. Because when it comes to scaling Kinesis, the units of scaling requires more shards, and you add a shard for every megabyte per second, or thousand records a second, you'll have to add more shards.


And the downside of this is if you have bursty workloads. We ran into one client who had to scale their Kinesis out to support a pretty bursty workload that they needed to ensure that they were always accepting in this data. And within a few second time period, they were bursting up to thousands, tens of thousands, hundreds of thousands of records a second, but then sitting idle the rest of the time; you obviously have to scale out to support that. So, in that scenario, actually using Kafka is a better option because it can handle a lot more data through it without requiring as much kind of shard scaling and cost associated. It's just in the end cheaper.


And so the other interesting thing, too, is MSK does not charge for the replication side of things like if you were to run Kafka yourself. So, before you go out there and say, “I’m going to run my own Kafka,” definitely plan on-network data transfer costs as well because I will tell you from personal experience, they will be larger than you expect. So EMR. Let's talk about the elephant in the room. That's a Hadoop joke out there.


Jesse: [laugh].


Pete: So let's talk about EMR. What's the first thing that most people are not doing with EMR, Jesse?


Jesse: They are not using spot instances. And now I know what you're going to say: “But wait. Spot instances, aren't those the things that will ultimately just die whenever AWS needs the resources back and my workloads may be interrupted at any time?”


Pete: Yeah, that sounds terrible. I don't like that.


Jesse: Yeah. That's thankfully not quite the case anymore. EMR has integrated with spot in amazing ways, and specifically, there are amazing new features called spot fleets, and spot blocks that can help you guarantee your spot instances for one to six hours. It's a slightly higher price, but it's still less than you'd be paying for on-demand EC2 instances. It is absolutely worth looking into.


Pete: Yeah, this one's a great one. Also, instance fleets are great because you can essentially augment on-demand with spots, and when the spots go away—even if you didn't use a spot block, but if the spots went away, they’d just get augmented with on-demand again. That's pretty powerful stuff. So, when using a spot block, though, it means that if you have a series of jobs that you know finish within three hours, then go and set up a three-hour spot block. You will have those instances available for those three hours.


You don't have to pay for all three hours; you're still charged the normal per-second billing that you would see, but it will not be pulled out from underneath you. And again, the less time that you can commit to, the better your discount. If you are okay with an immediate interruption, that's going to be the cheapest way to run spot. But obviously, these spot blocks are a great way to get some predefined tasks. And use spot blocks for your master, your core, and your task nodes as well.


Because again, if you're using those instance fleets, EMR will provision with on-demand when a spot goes away. So that's a really big one that a lot of people are missing out on. But it's not the only one. We also find that people are not adequately monitoring their jobs and their workload resource usage. This is just—it sounds crazy to say that, “Oh, my. People are over-provisioning EC2 instances.” I'm shocked.


Jesse: [laugh].


Pete: Just—right? Shocked.


Jesse: “I mean, I just set it and forget it, right?”


Pete: [laugh]. But it happens even more so on EMR. We’ve found clusters that might be running for 30 minutes when the job is only running for five minutes. I mean, there are CloudWatch metrics that you can grab to identify these idle clusters. And this is free money. I mean, this is click button, save money, which—


Jesse: Absolutely.


Pete: —we love to see. But what are a couple other practices that we've often recommended to our clients, Jesse?


Jesse: Some other highlights: set your average CPU utilization for your instances to about 80% for your jobs; that's a big one. That's the sweet spot we've seen where our clients get the best bang for their buck utilizing these cluster resources. Also, try to aim for runtimes between one to three hours. Again, this is where we've really seen that efficiency sweet spot. If you can run your jobs in less than that time, fantastic.


Because again, as Pete said, you will only be paying for spot instances as long as that spot instances active and running your workload. But if you can schedule your runtime for roughly one to three hours, that seems to be the ultimate spot instance sweet spot that we've seen.


Pete: Yeah, exactly. The other thing to do is audit your jobs that you've created and make sure that your engineering folks are not messing around with the Spark execute or memory setting in either your Python or Scala code. Honestly, most of the time that setting really never needs to be changed if you're using EMR. Instead, change the instance type you're running on. And this is where a little bit of research on instance type usage can really benefit huge savings.


Most folks are not using the right instance types when they use EMR in EC2 instance types. They just kind of pick one at random and then they just roll with it. But another thing to do is check your job memory profiles and you want to adjust instance type to match. So, an example we like to give our clients is, let's assume there's a [fake 00:20:04] instance type, like an m1-fake with ten cores and 64 gigabytes of RAM. And we want to assume that we want to keep about four gigabytes of overhead for the OS.


That's just a rough number, it could be anything depending on the OS you're using, but that's just going to leave us about 60 gigs of RAM available. So, if your executor is set for 6 gigs of RAM, then you'll have each node with ten of ten cores used. You'll have one job per core, each using 6 gigabytes of RAM. That is a fully-loaded instance. That's what you want.


That's the ideal scenario. But if someone were to change that setting, from 6 gigs to 10 gigs—which you don't want to mess with that setting—now you can only run six of those jobs on that host, which means six of your ten cores are in use. That's only 60 percent, 40 percent waste. 40 percent of that host is just sitting there unused. So that's why we often say to not mess around with those memory settings. You probably want to change your instance type first. So that's the 20-minute fast guide to how to save on EMR. Right, Jesse?


Jesse: Yeah. And if you have other questions about this, please reach out to us. You can hit us up at lastweekinaws.com/QA, but also feel free to tag us on the social medias, especially on Twitter. We are happy to continue to talk about this more.


Pete: Yeah, absolutely. If you enjoyed this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review. Give it a five-star rating on your podcast platform of choice and tell us what you love most about Kinesis other than the fact that it went down horrifically that one time.


Jesse: [laugh].


Pete: Thanks, everyone.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 12 Feb 2021 03:00:00 -0800
What the Hell is Amazon Web Services

Want to give your ears a break and read this as an article? You’re looking for this link.


Never miss an episode

Help the show

What's Corey up to?

Wed, 10 Feb 2021 03:00:00 -0800
Andy Jassy Ascends to Sea Level
AWS Morning Brief for the week of February 8, 2021 with Corey Quinn.
Mon, 08 Feb 2021 03:00:00 -0800
Moving Data Is Expensive and Painful (Just Like Moving Banks)


Transcript

Corey: This episode is sponsored in part by our friends at Fairwinds. Whether you’re new to Kubernetes or have some experience under your belt, and then definitely don’t want to deal with Kubernetes, there are some things you should simply never, ever do in Kubernetes. I would say, “run it at all.” They would argue with me, and that’s okay because we’re going to argue about that. Kendall Miller, president of Fairwinds, was one of the first hires at the company and has spent the last six years the dream of disrupting infrastructure a reality while keeping his finger on the pulse of changing demands in the market, and valuable partnership opportunities. He joins senior site reliability engineer Stevie Caldwell, who supports a growing platform of microservices running on Kubernetes in AWS. I’m joining them as we all discuss what Dev and Ops teams should not do in Kubernetes if they want to get the most out of the leading container orchestrator by volume and complexity. We’re going to speak anecdotally of some Kubernetes failures and how to avoid them, and they’re going to verbally punch me in the face. Sign up now at fairwinds.com/never. That’s fairwinds.com/never.


Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I am Pete Cheslock.


Jesse: I'm still Jesse DeRose.


Pete: We're still here. And you can also be here by sending us your questions at lastweekinaws.com/QA. We're continuing our Unconventional Guide to AWS Cost Management series, and today we're talking about moving data. It's not cheap, is it?


Jesse: No, it's definitely not cheap. It is expensive, and it's painful. And we're going to talk about why, today. And a reminder, if you haven't listened to some of the other episodes in this series, please go back and do so. Lots of really great information before this one and lots of really great information coming after this one. I'm really excited to dive in.


Pete: Yeah, look, they're all great episodes in the end of the day, right? They're just all fantastic.


Jesse: Yeah.


Pete: If I do say so myself.


Jesse: All of the information is important; all of the information is individually important—I think that's probably the best way to put it. You can listen to all these episodes and implement maybe just a handful of things that work best for you; you can listen to all these episodes and implement all of them, all of the suggestions. There's lots of opportunities here.


Pete: If you do actually go and implement all of these suggestions, you really should go to lastweekinaws.com/QA and tell us about it. We'd be very curious to hear how it goes. But if you're struggling with any of these, just let us know as well. These are things that are measured in long periods of time.


It is rare that we run into engagements with clients that you can just click box, save money. Now, don't get me wrong; there's a whole bunch of those, too. But if you want to just fundamentally improve how you're using the Cloud and how you're saving money, those projects are multi-year investments. It's just all of this stuff takes a long time. And you just got to manage those expectations appropriately.


And specifically around this topic, moving data, it is—as Jesse said—painful. It is expensive, especially in Amazon. They will charge you to move the tiniest bit of data literally everywhere, with, like, two minor exceptions. And it's just the worst. Data storage costs, so Duckbill Group, we've kind of become these experts on data transfer and data storage costs, understanding just the complexity around them. And I feel like a lot of times folks only think about the storage being the biggest driver of their spend.


Jesse: Absolutely.


Pete: You know, you never delete your data. But you put it all on S3, right, Jesse? Like that's a cheap place to put your data.


Jesse: Absolutely. Worthwhile. Put it in S3 standard storage, call it a day. I'm done, right?


Pete: Yeah, just do my little, like, wipe my hands, and go on, and we're good. Most people put it in standard storage, just like most people use gp2 EBS volumes; that's the standard everything. And that could be a big driver of cost, but more likely the larger driver—because it's a little bit more hidden, it's a little bit more spread around your entire bill is the transferring of data, the moving data around. And I say moving specifically because there are some services that are charged via I/Os. Via actually putting data into it or taking data out, not just the data transfer.


Jesse: I think it's also really important to call out that most companies that move into the Cloud don't realize that data transfer is something that AWS will charge you for, so I want to make that explicitly clear. As Pete mentioned, in almost every case moving data around, AWS will charge you for that versus in a data center environment where that's kind of hidden, that's not really explicitly a line item in your bill. And here, it absolutely is a line item in your bill and absolutely should be thought of as an important component to optimize.


Pete: Exactly. In the data center world, for any of the folks out there that are in a data-center land, or maybe hybrid-cloud land, your networking costs are, I mean, it's largely a sunk cost. You've got your switches and your lines that run, maybe you're—get charged for the cross-connects, and interacting, data transferring to other areas and things like that. But within your racks, within your own secure domains, you don't have to really think about the cost of those network communications because it's already paid for. And you're definitely not charged at a per-gigabyte level like you are on Amazon.


Jesse: So, we talked about this a little bit before in a previous episode, when we talked about context is king. Context for your application infrastructure is really, really important; understanding how your application interacts with other applications within your cloud infrastructure ecosystem; how your data moves between workloads. All of these things are really, really important, and so specifically, when we talk about data transfer, it's really important to not just understand how your data is moved around, but why your data is moved around. So, we really like to suggest working with all of the teams within your organization. Again, product, potentially legal, maybe IT, to understand your data movement patterns and the business requirements for those data movement patterns.


Why does your data need to move multiple times within an availability zone? Why does it need to move between regions? Do you need to have data that is copied across multiple availability zones? Do you need that data to be cross-region? These are some examples of really important questions to ask to understand, do you need to continue transferring that data? Because the more you can optimize the way that that data is moving around within AWS, the less money you'll ultimately spend.


Pete: Yeah, and this ties into, again as you've noticed, there's a reoccurring theme is that all of these episodes in this series, they do tie into one another, they build on top of each other in many ways; you can independently do these things, but they can compound and bring you bigger benefits. And so in a previous episode, we talked about your network architecture diagrams, how you could overlay costs. But how you should overlay the data flows on top of there as well. Again, those data flows will have an inherent cost to it. And Jessie, I love that you pointed out talk to legal because there are potentially risk and compliance requirements as it relates to your data and data transfer. Think about—


Jesse: Absolutely.


Pete: —GDPR and keeping data inside of certain regions, or from risk and compliance side, keeping your data actually in many regions, replicating it to other regions. And not that you shouldn't ever replicate your data, but I think what's important is—and the biggest thing about a lot of this stuff that we're talking about is providing knowledge on the cost of a decision. So, think about your business, and they've made this decision to replicate all their data into five availability zones. Okay, well, that will have a cost to it. If no one knows what that cost is, they can't make an updated decision.


So, when your boss is coming over your desk and screaming at you—well, not to your desk because COVID—but popping into your Zoom and, “Why is this bill so high?” And, “What is going on?” If you have that knowledge and say, “Well, here are the places where we spend our money, and these are the decisions, from a product risk perspective, that have driven these costs.” Any one of these decisions can be changed, right? Nothing is set in stone. They're all just different things that businesses have to think about.


Jesse: I think it's also really important to call out that as you and possibly your team or the individuals that you work with from other teams start to have these conversations, start also thinking about the best practices that you want to see within your organization for data transfer, whether that is specifically for the data transfer for a specific type of solution, like distributed stateful systems, like your Cassandra, your Kafka, make sure you can start to create best practices for these solutions. And really start to build communities of practice within your organization to decide how best to implement these best practices. So, for example, we worked with a client previously who did a lot of data compression on disk for their Cassandra cluster, but the data was essentially not compressed when it was moved between components of the cluster or between regions when it was replicating. And so there was all this data transfer that was just flying around uncompressed, costing a lot of money. And the team that was managing the cluster really wanted to get some best practices in place and the teams that were sending data to the cluster and reading data from the cluster wanted to get some best practices in place, but nobody really understood whose responsibility it was to put those best practices in place.


And this is a fantastic opportunity to build that community of practice together to make sure that everybody knows what those best practices are and build those best practices together to ultimately bring those costs down.


Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Checkout CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.


Pete: Yeah, I remember at a previous company, we were handling large amounts of data. And I'm pretty sure we had compression enabled for some of our data pipeline activities and consuming data from a variety of sources, and I still remember the day that we migrated over to Snappy compression from whatever we were using prior. Just those graphs dropped—data transfer graphs—dropped by so much. We legit thought we broke something. But those graphs going down dramatically, luckily we didn't break anything, but watching those graphs go down so dramatically also meant my bill went down dramatically on data transfer.


And that's a really good point. People miss out on compression a lot. In some open source applications, it wasn't an option. Like, you couldn't do it. Looking at you Elasticsearch.


Now, you can I believe in some newer versions, there's some compression there. Cassandra, I know had some issues with that for a while. We see it a lot with things like Kafka as well, people are not configuring their consumers with it. So that's definitely something to look at. I always like to talk about some of the actual things that you could look at in your organization to improve your data transfer spend.


A couple of other places I like to call out as well is—especially when understanding data flows—is one of my favorite—and I'm using air quotes right now—“Favorite” services is the NAT gateway, the wonderful NAT gateway service, which kudos to the Amazon product owner there who must sleep on a bed full of $100 bills, and on their hundreds of millions of dollars of yacht because they are making an obscene amount of money on this service that does very little, in my mind; in my personal opinion. And we do look at a lot of Amazon bills, and there's a lot of folks spending millions of dollars a year for NAT gateways. So, you really have to ask yourself what is that service providing? There's a lot of folks online that probably talk about, “Well, you have to run your instances in a private VPC because of security reasons.” And sure, there's probably some reasons for that, but if you have some instances that are constantly communicating with the public internet, then those may actually be better off in a public subnet. You have security groups, right? Firewalls, these exist.


Jesse: I think it's also important to call out here that you can look at the actual data that is traversing your managed NAT gateway to decipher how much of it is internal AWS traffic to other AWS services like S3, DynamoDB, and other services that you can move to a VPC gateway instead. Now granted, some of those VPC gateways are free, some of those VPC gateways do charge you for the amount of time that they're run and the amount of data that is traversing that gateway, but it is absolutely worth running the numbers to see if the amount of data that you're sending to some of these internal AWS services, can move to a VPC gateway because we've seen clients save lots of money by keeping all that traffic internal rather than sending it out through a manage NAT gateway to the public internet and then back into AWS, through whatever AWS service gateway.


Pete: Yeah. And the two services that are actually free—you can go enable these right now—are for S3 and Dynamo. So, imagine these two scenarios. You have a server in a private VPC with a NAT gateway. That virtual private network, that VPC that that server is in is truly a secure network.


And when you're communicating with these other Amazon services, you are leaving. You're going to the public internet to talk to S3. Which means you're traversing that NAT gateway. So, if you have a service that is, maybe, pushing a lot of big binary content to S3, you're going to get charged not only for normal data transfer costs if it crosses AZ boundaries, but you're going to also get the four and a half cents per gigabyte added fee on top of that. And you can literally avoid that entire fee by—this is one of those click-box things: you click a box and you enable S3 endpoint to allow that service secure communication to S3.


It's almost like they're trying to force people to not use NAT gateways. Maybe NAT gateway, the architecture inside Amazon is just so terrible, that they're actively charging an obscene amount of money to get people to not use it. But it's clearly not stopping anyone, you can tell that I'm very salty about NAT gateways. And just from my own personal experience, I remember I used to run my own NAT instances. That's what you did before NAT gateway was a thing.


You spun up an instance. All of mine were t class whatever’s. Because, again, most of my instances weren't doing heavy communication out to the internet; I didn't need a ton of bandwidth, so I would spin up these t2s, I'd run my own NAT instances, they were inside an auto-scaling group. If they died, they came back, it was the easiest thing ever. And then one day NAT gateways came out and I thought, “Oh, well, it's a reason to just run less EC2. That's totally fine.” So, I flipped all those services over. And then in a few months, “I'm like, why am I spending 10,000 a month for these NAT gateways?” I was spending, like, $500 a month for my t class instances. And so I moved everything back. I'm just like—it's annoying. It's my most hated service at Amazon.


Jesse: Thanks, I hate it.


Pete: [laugh]. So don't give them your money for it. You can solve this problem set up those endpoints like we're talking about. We've seen so many clients save just tons of money by setting up those endpoints in their VPC to talk to other Amazon services or turn on flow logs for a couple of hours. Don't turn them on for a long time—


Jesse: Please.


Pete: And surely do not send them to CloudWatch. Send them to S3. You can query this stuff in athena; there are tons of posts on the Amazon blog in the documentation that will teach you how to query those. And you can start looking to see, where are my things connecting to? Spoiler alert, if you're a data customer, they're talking to Datadog.


Jesse: And I think that's a really quick fun thing to point out to, which is some third-party software solutions that are also on AWS have their own private link gateways that you can configure and connect to so that you don't have to send your data out through the public internet and then back in through with their connectors, you can send the data directly to them through additional VPC gateways.


Pete: Yeah, those private links are just a great service. They will be a lot cheaper to send your data to those third-party vendors, and they're honestly more secure. One thing to keep in mind though, again—I love giving these actionable tips if at all possible—is if you do let's say your vendors can only take their connections in from us-east-1 and you’re in another region, you will have to pay to cross an AZ boundary, but in almost all scenarios—there's always some exceptions—it's still going to be cheaper to cross an AZ boundary and to send to the private link. Because again, NAT gateways are just so prohibitively expensive.


Jesse: Now that we've written a love letter to VPC gateways, I feel like, I want to spend a few minutes talking about the hidden costs of I/O before we wrap this up and send you out into the world to look at all of your data transfer bit by bit.


Pete: And become horrified.


Jesse: Yeah.


Pete: So, we talk about moving the data, but when you move the data, you have to make, usually, an I/O; you have to make some sort of communication. And there are more and more services that are charging based on those I/Os, Aurora being probably the largest, I don't know maybe most popular, Jessie, if you think of it that way.


Jesse: Yeah.


Pete: These I/O costs, they can be hard to predict, but we've definitely seen scenarios where folks are ingesting data into their Aurora from S3, but all of those I/Os, all of those writes, they're going to get charged for. The I/Os, plus the storage, plus the engine. So, you have these three vectors are being charged upon.


And so you really need to start thinking about these usage patterns. How are writes happening? How are reads happening? What are the size of those I/Os as well? Which you'll have to dive into the documents to figure out how best to optimize because of how, again, they charge for these I/Os. But if you are constantly reloading data into an Aurora database, you're getting additionally charged for all of this data movement. The movement is causing these I/Os to occur.


Jesse: Yeah, and this really makes the case for a data warehouse, which I hate the term—or data lake, whatever the hot new phrase is that all the kids are using. But it makes sense. You can keep your data in one place and set up access for different teams to be able to access different parts of that data. Data transfer in and out of S3 is free, and then you can use all of the query functionality of Athena or other tools within AWS to do all of the queries you need and all of the calculations you need with that data.


Pete: Yeah, I agree. Data warehouse, data lake. These are stupid names.


Jesse: [laugh]. I’m glad that that's the thing that you agree with me on.


Pete: [laugh]. Yeah, exactly. That's the only thing, Jesse. I actually had a joke at a conference I did. I hated these—I was basically preaching to this audience about how you should turn all of your monitoring into a data lake: your logs, your metrics, your traces, everything, and centralize it for query for analysis.


And I just hated the term so much that I just wanted to come up with something new that had equally no meaning, so I call it a bagel, a data bagel. And, again, it has no meaning, so what does it matter?


Jesse: So, is that like, a security around the outside, and then the massive security hole through the middle?


Pete: You know, we could spend a whole episode analyzing my nonsense. Not today, Satan. So it's true, though, right. Exploit the fact that you can transfer into and out of S3 from many or most Amazon services for free, and use that to your game, use that to your benefit to centralize as much data as possible into that one place. Some of the most mature organizations we work with push all of their data as much as possible into S3, and then will pull it into other services for query.


I mean, you could do ad hoc queries with Athena, but you can ingest it into a redshift cluster and do some analysis. If you're a Snowflake user, obviously, they can suck your data right out of your S3 into Snowflake for analysis. That data transfer is free. Crossing an AZ boundary is not free. Again, think about your data flows.


Can you have instances pushing data to S3 where other instances can pull that data out? It's in an incredible service. It's shocking that there's anything for free in Amazon so when you can find those slight benefits, exploit them for the greatest gain possible.


All right, well, I thoroughly enjoyed just hazing on NAT gateways and I could probably keep going for a long time. But we will save that for another time. If you did enjoy this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, go to lastweekinaws.com/review and still give us a five-star rating, but then also go to lastweekinaws.com/QA and give us your question, give us your feedback. We would love to answer them in a future episode. Thanks again.

Fri, 05 Feb 2021 03:00:00 -0800
Elastic Throws in the Towel on Open Source, Chooses SSPL

Want to give your ears a break and read this as an article? You’re looking for this.

Never miss an episode



Help the show



What's Corey up to?

Wed, 03 Feb 2021 03:00:00 -0800
Unsafely Accelerating AWS Customers
AWS Morning Brief for the week of February 1, 2021 with Corey Quinn.
Mon, 01 Feb 2021 03:00:00 -0800
The Unconventional Guide: The Cloud Is Not Your Data Center

Links


Transcript

Corey: This episode is sponsored in part by our friends at Fairwinds. Whether you’re new to Kubernetes or have some experience under your belt, and then definitely don’t want to deal with Kubernetes, there are some things you should simply never, ever do in Kubernetes. I would say, “run it at all;” They would argue with me, and that’s okay because we’re going to argue about that. Kendall Miller, president of Fairwinds, was one of the first hires at the company and has spent the last six years the dream of disrupting infrastructure a reality while keeping his finger on the pulse of changing demands in the market, and valuable partnership opportunities. He joins senior site reliability engineer Stevie Caldwell, who supports a growing platform of microservices running on Kubernetes in AWS. I’m joining them as we all discuss what Dev and Ops teams should not do in Kubernetes if they want to get the most out of the leading container orchestrator by volume and complexity. We’re going to speak anecdotally of some Kubernetes failures and how to avoid them, and they’re going to verbally punch me in the face. Sign up now at fairwinds.com/never. That’s fairwinds.com/never.


Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field.


Jesse: I like that. I feel like that's good. That's a solid way to start us off.


Pete: Triple F. I am Pete Cheslock.


Jesse: I'm Jesse DeRose.


Pete: #TripleF. We should get some, I don’t know, jackets made? Mugs?


Jesse: Lapel pins? I'm open. I've always wanted a Members Only jacket.


Pete: If Guy Fieri can call diners, drive-ins, and dives, “Triple D,” then we can definitely call this Triple F.


Jesse: We can definitely make this happen.


Pete: It's not my high school transcript, either, we're talking about here. Oh, well, we are back again, continuing our series on The Unconventional Guide to Cost Management with Episode Two: the Cloud is not your data center.


Jesse: Yeah, this one's gonna be a fun one. I feel like this is a topic that comes up a lot in conversations, sometimes with clients, sometimes with potential clients that are asking, “What kind of things do you see day-to-day? What are some of the big pain points that you see with your cost optimization work?” And so real quick backstory, make sure that you've listened to the previous few episodes to get some context for this segment that we're doing and get some framing for this Unconventional Guide work that we are discussing. But talking about using the Cloud as a data center, I have a lot of thoughts on this.


Pete: Well, hold on a second. Isn't the Cloud just someone else's data center?


Jesse: [laugh] I—yeah, you know, this is the same argument of serverless isn't actually serverless. It's just somebody else's computer.


Pete: [laugh]. Someone else's Docker container. But really, there's a lot of ways we're going with this one. But we're coming at it from, obviously, a cost management perspective. And the big, bold, unpopular opinion that we're gonna say is, the most expensive way to run an application in the Cloud, is by treating the Cloud as just another data center; it's going to cost you way more than it would cost to run in a normal data center. And this goes to the world of, in the early days of Cloud, people just raging online and in conferences about the Cloud, it's so expensive. And yes, it is so expensive, if you treat it like an antiquated data center.


Jesse: And really quick before you get your pitchforks out, there is this concept of ‘lift and shift’ that everybody likes to talk about or ‘technical transformation’ that everybody likes to talk about: moving from a data center into the Cloud, which a lot of people see as this movement where they just uproot everything from their local data center into AWS. And to be clear, we do recommend that. That is a solid strategy to get into the Cloud as fast as possible; just move those workloads over. But it is going to be expensive, and it's not what you ultimately want to stick with long term. So, that's ultimately the big thing to think about here.


Yes, lifting and shifting from your data center into the Cloud is absolutely worthwhile. But it creates this shot clock that's now running after your migration is complete, where if you don't move on to all of the services, and opportunities, and solutions that AWS provides that are native solutions, cloud-native solutions, managed solutions, you're going to end up spending a lot more money than you want.


Pete: Yeah, “The Lift And Shift Shot Clock” that was a great blog post by Forrest from ACG—ACloudGuru. We'll include a link to that in the [00:04:35 show notes]. It talks about how not only do you have technical debt accruing as you lift and shift, but potentially the brain drain as people get sick of managing this hot mess that you've lifted and shifted over. That doesn't mean you shouldn't do it.


You absolutely should get into the Cloud, get into a singular vendor with your workloads as fast as possible so that you can then dedicate resources to refactoring all of that. Don't just forget about it and leave it behind. It's not going to end well for you. And you do have a time; the timer is running. So, when you're only using those core primitives—compute, object store, block store—yeah, you're going to have a pretty fixed cost on your cloud bill.


But to Jesse's point, there's a lot of other services. Some of those require an engineering effort. Some of those just involve correctly using an instance type, a storage location that is more specific to its access patterns. I mean, everything is basic as T class instances—for those services that maybe don't use a lot of CPU—to reminding yourself that there are multiple tiers of S3 storage. Even Intelligent Tiering will just tier it for you.


So, if you go and store everything on standard S3 storage and use GP2 volumes on EC2, yeah, it's gonna be expensive. And I know that because I look at a lot of Amazon bills, and Jesse does too, and we see the same thing. “Oh, you've got a really high bill.” “Yeah, we spend a lot on EC2.” It's, “Like, oh, let me guess. A lot of, like, I3s and C5s and M5s and a ton of EBS, right?” And they give you all this optionality, and I think it's that choice which is so overwhelming for many folks moving to the Cloud. I mean, that's, that's really the case. It's just, “What do I pick?” There's just so much.


Jesse: So, let's talk about ephemerality, especially in the world of compute. Ephemerality really means savings, in this context. When you think about workloads that maybe are intermittent workloads or request-based workloads, if you have peaks and valleys of demand, there's going to be times where that workload is extremely busy processing all of those requests. And then there's going to be times where there are no requests coming in, and your servers are sitting idle, and you're paying for all of that compute usage that's not doing anything. So, if you can move your compute resources towards ephemeral resources: when you think about spot instances when you think about moving from EC2 to ECS or Fargate, you will end up only paying for the time that your workloads are actually running and are actually processing requests rather than 24/7.


Pete: Yeah, I think we need to break away from this trope of, “Well, high CPU is bad.”—


Jesse: Yeah.


Pete: —“Because anything less than a hundred percent CPU is waste.” Now, hold on a second, someone out there who runs a lot of stuff on the JVM says—


Jesse: Don't @ us, please.


Pete: Remember, you can go to lastweekinaws.com/QA and register your complaint there. So, I understand. You have to run this Java application and it is an unholy hot mess and you need to just put a whole bunch of memory in that box. It's just, “I need a lot of memory, so I need that big instance.”


Well, again, look at the CPU access patterns. That's what these T class instances are for. That's what they're for. That's what they're designed for is to take and let you have that memory you can allocate to heap for intermittent workloads. Try it out. Guess what? If it doesn't work, you can always move it to another instance. It exists, right? [laugh].


Jesse: I think this is again, getting to your point, Pete, that you mentioned before, which is there's such a wide variety of options within the realm of compute. Where do you begin?


Pete: Right.


Jesse: What do you want to start with? And most customers think, “Okay, my bare metal servers sitting in the data center had this amount of CPU and that amount of RAM, so I'm just going to spin up a bunch of servers that have the same thing.” That's not necessarily what you want. And that's not necessarily what you need.


Pete: Right. And I know, Jesse, you mentioned before all these higher-order services within Amazon. And when you look at the cost for those, oftentimes it can appear to be a lot more expensive.


Jesse: Yes.


Pete: And so you'd say to yourself, “Well hold on, I'm moving from these EC2 instances to Dynamo. Moving for my Cassandra cluster to Dynamo, this is going to be so much more expensive.” The trick to that is, again, you have to understand your usage patterns because especially on Dynamo, you can alternate between on-demand tables, that maybe don't cost you very much and are truly only charged for when you use them, versus provision tables. You can auto-scale up those tables and auto-scale them down again as needed. And that's truly revolutionary when you've dealt with managing a Cassandra cluster on EC2.


Jesse: My heart goes out to everybody who has actually managed Cassandra clusters.


Pete: Scars everywhere. I wake up in cold sweats some nights remembering some of those Cassandra management issues. But outside of just my mental health for managing Cassandra, there's the overhead of all those systems, of all that EBS involved, network and data transfer. Goes back to the great story of, “I can tell you're running Cassandra by looking at your data transfer bill on AWS.”


And it's the people, too. I think most companies are very bad at the opportunity cost of managing their own databases. Sure, if your business is DataStax, and it's running a hosted Cassandra cluster for your clients, yeah, that's your core business model; you should be very good at that. But for most other people, maybe focus your time on your product and making it a lot better versus messing around with self-managed databases.


Jesse: This is one of the big opportunity trade-offs with AWS managed services. There's a lot of freebies, essentially, that AWS managed services provide, that you wouldn't get if you were running something from scratch on on-demand EC2 instances. And so for example, a lot of stateful distributed services require a lot of replication, for example, to keep data up to date and keep the cluster up to date. So, let's say you've got a cluster in one region, you deploy it across a couple AZs to keep availability high, and then you deploy it also in another region for some form of disaster recovery. Or maybe you've got an active, active application setup that requires Cassandra running in two different regions simultaneously. That's a lot of data that's replicating back and forth, just within one region and then across regions. Now, if you move on to one of AWS’s managed service solutions for the same workloads, a lot of that data transfer is free.


Pete: It’s free. It's free. It's crazy that Amazon would give anything away for free.


Jesse: Yeah.


Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Checkout CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.


Jesse: And this gets to the other point, too, about management overhead; all of these AWS managed services build in the cost of that management overhead. So, on its surface, these managed services are more expensive than straight up using an EC2 instance with your workloads, but when you start factoring in all of the other components, like engineering effort, data transfer, administrative overheads, these managed services start looking pretty good.


Pete: Yeah. Let's be really honest here. Open source is only free if your time is worth nothing. And many engineers have a very low value of their own time, and—you know because you want to solve problems, you want to make things better, you want to fix the thing in front of you. You can't envision a world where the thing in front of you isn't there anymore.


There's always something to fix. There's always something to make better. And so it's really a statement for the flexibility that these managed services give you. I think too, as well around some of these— the Dynamo example is a great one— you can dynamically modify Dynamo tables to match workload needs. Consider Postgres, running Postgres on EC2, versus RDS [00:14:06 unintelligible] running it on Aurora.


The flexibility you have in dynamically adjusting the size of that engine, that pays for the added cost. We roughly see it's about 20 percent more expensive to run a database on RDS. That 20 percent allowing me to very easily back it up, maintain the availability to do multi-availability-zone easily. Aurora has some multi-regional functionality; there is just so many features I don't have to think about. That if someone says, “Hey, can you increase the size of this instance?”


I don't have to go into sweats thinking about, “Oh, my God, I don't want to accidentally nuke all this data.” I can update my Terraform or, God forbid, go into the Amazon Console and click-click, and just make bigger, right? I wish you could say I can't put a price on that, but maybe the price is 20 percent more. But then I can go do something else that is far more valuable to the success of the business.


Jesse: Absolutely. So, when we talk about moving from data centers into the Cloud, and we talk about leveraging AWS as a data center, AWS has so many amazing features and opportunities for you, just waiting to be used to help you lower your bill. And yes, there is a little bit of lock-in, in terms of now you're using AWS native solutions that you can easily move to, let's say, GCP or, God forbid, Azure. But it doesn't mean that you aren't getting amazing bang for your buck; it doesn't mean that there aren't other opportunities to use those same—or similar services, I should say— in different cloud providers, which is a whole topic in and of itself. But don't just use the default resources in AWS. Make sure that you migrate into AWS and then move into all of the amazing native solutions that AWS has to offer.


Pete: You know, everyone is always so scared of vendor lock-in. I feel like people have been preaching about vendor lock-in for decades now, that I've been in tech. And the reality to vendor lock-in as it relates to Amazon specifically is that sure, you could run everything on EC2, use no native services at all. But wait, didn't you use IAM for all of your authentication and access control? Whoops, didn't you use all that Terraform— which is very specific to the AWS APIs?


There is real work in actually moving off. So, then let's say you end up moving to GCP, and your entire engineering and ops team quits because they are all experts on Amazon, not GCP. They don't want to have to deal with that. Or as you move to, let's say Oracle or Azure, and then you’d just have an armed rebellion. That may be a little bit too on the nose given recent times.


But vendor lock-in is, I don't believe, as much of a thing as people give it to be. Vendor lock-in, you're locked into really all the decisions you make in general, doesn't mean you're locked in forever. If the business wants to change the type of database that they run underneath the hood, they can prioritize that over maybe growing revenue. But I think the more conversations you have about that with actual executives at a business, they say, eh, just keep running the thing; we need to grow revenue instead.


See, at a high level the biggest gain that you can see within a business to move quickly, to spend the least amount of money, realize that there are a lot of these services that will help you increase your ephemerality: Fargate, Lambda, you can run spot instances with ECS, and spot instances with Fargate, defined duration spot instances. If you're really scared about instances being just ripped away from underneath you, but you still want to save some money, you can just define a duration, say I want this server for one hour. If you're running any sort of EMR Hadoop task, that's a great way to say, “Great, I'm just gonna run non stop for an hour.” I don't have to worry about this host going away.


So, again, a lot of tools out there that exist. Don't let the initial dollar amount scare you, and really try to take a more holistic approach, and add in the engineering time as well. And maybe what else you could be working on. When you start thinking about some of the improvements that you can make in being more cloud-native, I think is what the kids call it nowadays. Right, Jesse?


Jesse: Absolutely.


Pete: Well, if you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, still go to lastweekinaws.com/review, give us a five-star review, and then go ask us a question. Give us your feedback at lastweekinaws.com/QA. We will be pulling together those questions, and feedback, and hot takes, and warm takes, and even the cold takes, we're gonna read all of them. And we will answer those in a future episode as we talk about more of The Unconventional Guide to Cost Management. Thank you.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 29 Jan 2021 03:00:00 -0800
AWS Compensation Explained

Want to give your ears a break and read this as an article? You’re looking for this link.


Never miss an episode


Help the show


What's Corey up to?

Wed, 27 Jan 2021 03:00:00 -0800
Elasticsearching For A Business Model
AWS Morning Brief for the week of January 25, 2021 with Corey Quinn.
Mon, 25 Jan 2021 03:00:00 -0800
The Unconventional Guide to Cost Management: Architectural Context

Check out the full unconventional guide here!

Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if wanting new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.


Pete: Hello, and welcome to AWS Morning Brief. I am Pete Cheslock.


Jesse: I'm Jesse DeRose.


Pete: This is Fridays From the Field. Triple F.


Jesse: I feel like we've really got to go full Jean-Ralphio, Parks and Rec there. “Friday From the Feeeeeeeeeeild.”


Pete: Yeah, so we're going to need to get an audio cut of that and add some techno beats to it. I think that's going to be our new intro song.


Jesse: [imitates techno beats].


Pete: Yeah, we're going to take both of those things. I'm glad we got this recorded because that's going to turn into a fantastic song. So, we're back to talk about The Unconventional Guide to Cost Management. And this is the first episode, this is the first of a whole slew of these that we're going to be going through from the field, these different ways that companies can impact their spend. And no, it doesn't mean go and buy the cloud management vendor of the moment to look at your spend or fire up Cost Explorer. Those are all pieces of it, but broader things, the big levers, the small levers, the levers that don't actually go back and forth, but you turn and you would have no idea because it was designed by an Amazon UX engineer.


Jesse: Yeah, it's really important to call out that this discussion is looking at your cloud spend from a broader perspective and if you didn't get a chance to listen to our episode from last week, we did a little bit of an intro, framing this entire discussion. Go back and take a listen, if you haven't yet. Really talking about why looking at cloud costs through these different lenses is important. Why are you thinking about cloud cost, not just from the perspective of, “Oh, I'm going to delete these EBS snapshots,” or, “I'm going to tag all my resources,” but why is it important to think about cloud costs from other mediums?


Pete: Exactly. So, don't forget, you can go to lastweekinaws.com/QA and put your questions right in that box. Your name is optional. You can just leave your name blank if you don't want anyone to know who you are. Or if you want to say something really nice about me and Jesse, and you just feel a little shy—


Jesse: Aww.


Pete: —that's fine, too. But just put a question in there. And we're going to dedicate some future episodes to answering those questions and diving a little deeper for those that want to know a little bit more. But as being the first episode, we got to talk about something, so what are we talking about today, Jesse?


Jesse: Today we are talking about architecture and architecture context. Now, this is a really, really interesting one for me because the first thing that I think anybody thinks about when they think about cutting costs with their AWS spend is architecture decisions: something related to your infrastructure, whether that's tearing down a bunch of resources, or deleting data that's lying around. But there's a lot more to it than that context is everything. Knowing why your infrastructure is built the way it is, knowing why your application is designed the way it is, is really important to understanding your AWS cloud costs.


Pete: This is where I feel like the Cloudabilitys CloudHealth, CloudCheckr Cloud-whatever companies, their products, sadly, fall down. And similar for every Amazon recommendation engine inside of AWS, they all break down. They lack the knowledge and the context of your organization. I remember a really long time ago, I had installed CloudHealth for the first time, and it said, “Hey, we've identified all these servers. They're sitting idle. Do you want us to turn them off for you?”


Those servers were actually my very large Elasticsearch cluster. They were idle because if no one's querying them they don't do anything, but they sure do hold a lot of data, and they really do need to be available. So, please, please don't turn those off. But that same thing could happen if you were—you know, due to risk or compliance reasons, you had to run some infrastructure as a warm standby in another availability zone or region. Yeah, sure, it's not taking requests, it’s not doing anything, but that doesn't mean that it's not supposed to be running.


Jesse: And this is really getting at one of the first big ideas, which is: work with other teams within the company. Not just other engineering teams, but product teams, possibly also security teams to understand all of the business context for your application and for your infrastructure in terms of data retention, in terms of availability, in terms of durability requirements. Because ultimately, you as a platform engineer, or an SRE, or a DevOps engineer, or whatever the hot new title is going to be a year from now, you need to understand why the infrastructure exists, and you may see servers that are sitting around idly doing nothing, but that's your disaster recovery site that is required by the business, by a service level agreement to be available at a moment's notice if something goes wrong. And so it's really important to understand what those components are and how they work together to build your overall application infrastructure.


Pete: Yeah, that's a great point. I mean, having that knowledge that if you've been at a company for years, you've got a lot of this historical knowledge. People have come and gone, they've come, they've done things, they've implemented items, they've brought new features, they've gone. As companies grow may or not— may not be a single person who really truly understand the impact of various changes. I think we saw that most clearly when Amazon had their Kinesis outage: the amount of different services that were impacted was pretty large because it's just all too big for any one person to understand.


But that doesn't mean that you shouldn't always continually be working to understand those different usage requirements, and chatting with the non-tech teams. Product teams, I feel like are often ignored in startups because you don't really want more work, and that's what those product teams normally do, right? But they're going to have a lot of context.


I remember working in SaaS companies and looking at things like, “This? We don't use this anymore. There's no way we use this. I'm going to turn this off.” And then, I then say, well, the smarter minds prevail. I say, “Well, let me go talk to product people.” And they go, “Oh yeah. We can't get rid of that one super important API because this one client of ours paid us an obscene amount of money to make sure that we always support it.” It's like, wow, dodged a bullet on that one, right?


Jesse: Yeah. And I feel like this gets at another important idea, which is the idea of communication and removing tribal knowledge, removing information silos. Really make sure that this information is communicated to everybody, whether that is in written documentation, whether that is in verbal communication— actually, it should probably be in both, ideally— to make sure that everybody understands why your architecture is the way it is so that they have the context to know that that server that's sitting idle is worth keeping around, or that API that never gets any requests is kind of important, and you can't just get rid of it.


Pete: Yeah now that in the cloud world, you need to pay money every time you do every little thing—transfer data, provision servers, I/O costs— overlaying those types of costs into your architecture diagrams, you know those architecture diagrams that you were told to create and keep up to date five years ago?


Jesse: We absolutely still have those.


Pete: Yeah, we totally have. They're super accurate. But make those architecture diagrams work more for you by having more information in them. Don't just put little lines that connect every server to every other server. That's not helpful to people, maybe map out the data flows, and the volume of data flows between different applications or the consumers of them. And then for just true next level cost management experience in your organization—


Jesse: [singing] Ahh.


Pete: —put some dollar amounts, and if they're close to accurate, that would be neat, too. They don't even need to be a number. They could be a percentage, too, you could always go look at later. That information, it now you have this architecture diagram that can be used by a whole slew of people. It can be used by finance so that they can look at it and say, “What's the most expensive part of our infrastructure?” And it could be used by product teams to understand how things interact with each other, and engineering teams, how to debug when things go wrong, versus just boxes with lines.


Jesse: Yeah, I think this also gets at a really interesting point, looking at all of these components of your application architecture and your infrastructure architecture in diagrams, that will also let you understand where your data is going, how much of your data is moving from place to place. And we'll talk about this later in a different episode, but this gives you the opportunity to really better understand the data flow from resource to resource, from microservice to microservice, and understand how much data you're actually moving around and why that data is moving, or where that data is moving.


Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Checkout CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.


Pete: Yeah, to put this into a bit more perspective, we do a lot of cost optimization projects with our clients where we really dive in, technically, into the architecture, chat with their engineers to understand how and why they chose certain services. And there was one client of ours who had a lot of spend, but it was hidden inside of the movement of data, and the duplication of data by reloading Aurora RDS instances from nightly data dumps into S3 and pushing this data through Kinesis into multiple Kinesis streams that would be consumed by, kind of, various downstream applications. And it was not the engine of all of these that cost them the most money; it wasn't the fact that they ran a certain size of Aurora, and it wasn't even the data storage costs. It was all of the I/O, and all of the data movement, and network charges that were really starting to add up. They didn't have any data flow diagrams that showed data moving from these different places.


If they did, they would have really clearly seen all of this duplication that was going on and they could have been able to resolve this in a more timely fashion. But they're no different than a lot of other clients out there: they don't have this central place that they can put this information. I mean, in their example, by moving towards more of a data lake type model, versus just continually dumping into reloading databases, but also even a Kinesis world, you can have these fan-out streams so that you don't have to have multiple duplicate streams and copying data from one place to another. But those data flows can expose a lot of spend that might really be hidden. I always like to talk about how moving data is so expensive, and we actually dedicate a whole episode just to talk about that concept, but that's how you can identify those items, those spend items being so large in your organization, is that why; that context.


Jesse: Absolutely. And that is why context is so critical when building your application architecture diagrams and your infrastructure diagrams. You really want to make sure that you understand how your portion of the application works, but also how all the other portions of the application works so that, ultimately, everybody can be on the same page together, and everybody can help each other accomplish those goals and really help the business move forward together.


Pete: Yeah. And it doesn't even stop at just accepting what you find. Asking these questions and getting this information, I mean, you're asking yourself, do we really need this cluster to be replicated across four availability zones or five availability zones. Would our risk tolerance allow for three? Would it allow for two?


I mean, I saved a bunch of money in the past, I used to replicate a Elasticsearch cluster across three AZs. That was just our standard. Everything we deployed, went across three AZs. And then we started looking at Elasticsearch and going, “Well, we're only running a primary and a secondary replica shard. So, there's only two shards, so why not just have one in one AZ and one in—what's the point of the three AZs?”


And we were able to cut out a large amount of spend by better understanding and asking those questions. Asking why; getting the context because that's the thing too is there were a lot of decisions—especially at high growth startups that I've been at—that decision was a great idea, four years ago, and it was critical to the success that got the company to survive long enough to even ask these questions now. So, just because, again, someone wrote it down four years ago, and that's the way it was done then, you might have to reassess that. I think also, too, when it comes to different things related to multi-region or multi-datacenter strategies, sadly, a lot of folks haven't really caught up on Cloud yet, and so they might say, “Yeah, you have to be multi-region.”


It's like, well, what's the actual requirement? Does it require separate network endpoints and separate physical availability type zones? You know, the availability zone are separate disaster domains. I mean, they're to be shared in their regional power and network and things like that, but in a lot of cases, that requirement far exceeds these legacy multi-datacenter worlds that used to exist. So, asking those questions, getting that context helps really drive down to the crux of the spend. And then you can really impact it to the business and move it, hopefully, in a better direction.


Well, if you have enjoyed this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating, and then right after, go to lastweekinaws.com/QA and give us a question. We look forward to answering that in a future episode. Thanks.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 22 Jan 2021 03:00:00 -0800
The Various Billing Philosophies of AWS

Want to give your ears a break and read this as an article? You’re looking for this link.


Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 20 Jan 2021 03:00:00 -0800
Replicating DynamoDB the Dumb Way
AWS Morning Brief for the week of January 18th, 2021 with Corey Quinn.
Mon, 18 Jan 2021 03:00:00 -0800
Introducing From the Field: The Unconventional Guide to Cost Management

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Transcript

Corey: When you think about feature flags—and you should—you should also be thinking of LaunchDarkly. LaunchDarkly is a feature management platform that lets all your teams safely deliver and control software through feature flags. By separating code deployments from feature releases at massive scale—and small scale, too—LaunchDarkly enables you to innovate faster, increase developer happiness—which is more important than you’d think—and drive transformation throughout your organization. LaunchDarkly enables teams to modernize faster. Awesome companies have used them, large, small, and everything in between. Take a look at launchdarkly.com, and tell them that I sent you. My thanks again for their sponsorship of this episode.


Pete: Hello, and welcome to the AWS Morning Brief: Friday From the Field. Triple F; that's what we're calling it now. We’re going a new direction. I'm Pete Cheslock.


Jesse: I'm Jesse DeRose, and I'm so excited for Triple F.


Pete: Triple F. Hashtag Triple F. So, moving away, taking this into a new direction, we have… not stolen that's a little bit too aggressive. But we have been lovingly gifted this podcast from Corey Quinn after taking over while he was on paternity leave, we just kept on doing it; we never stopped, we never let him have it back. And he was nice enough just to give us this opportunity to take this Friday podcast into a new direction and talk about things that we're seeing as cloud economists in the field working with our clients.


Jesse: Yeah, it really started as this confessional discussion of weird architecture patterns that we've seen, but then it definitely morphed into more of the other things that we've seen from either our work with Duckbill or work with previous engagements or previous companies. So, it just felt fitting to rebrand just ever so slightly and focus more of our efforts on what are the things that we're seeing day-to-day? What are the major problems that our clients are seeing? What are some of the pain points we've seen? What are the new features from AWS that are really the interesting and important things to talk about?


Pete: Exactly. We have an interesting insight that I think a lot of folks in the industry don't get to see. We, for one, look at countless Amazon bills, seeing how people are spending their money. But we also are often reached out to directly to help engineering teams better answer questions that they're getting from finance. I mean, that's the biggest fear I have—


Jesse: Yeah.


Pete: —CFO comes walking over to my desk, and I haven't submitted an expense report recently like, what do they want?


Jesse: [laugh]. I didn't do it. It wasn't me.


Pete: Even worse is when some of your executives start learning some of these terms. And they say, “Hey, what's our cost per unit on Amazon Cloud?”


Jesse: Yeah, it is something that has morphed from just a conversation about engineering teams thinking about their architecture patterns and what might be best for them to getting the entire company involved—especially finance—to ask all these questions and really think about, what's the bottom line here? How can we better understand this cloud spend?


Pete: I know most people are probably thinking, “Doesn't tagging solve this problem. Can’t I just tag everything, and then I have all my answers, right?” Problem solved.


Jesse: I'm sorry, did you just tell me to go F myself there, Pete?


Pete: [laugh]. Obviously, we both know that even the best of companies, the most mature companies we work with, yeah, they might be about 90% plus fully tagged, but even those companies still have to put in a lot of effort to answer these questions and to understand where their spend is going. Because they say, that which gets measured gets improved. So, are you measuring your spend? Are you measuring your growth? Do you understand how your spend changes as usage changes, your customers change? I mean, there's countless questions. But there's another thing that we see, too, Jesse, right? This circle of pain, the—what is it—the cost management circle of pain.


Jesse: Yeah. Yeah. It's this really fascinating idea focusing on cloud cost optimization, where a company will realize that their cloud spend has gone up for whatever reasons, and they say, “Oh, no. We need to do something about this.” Whether that is because finance has come over and asked the question, or because engineering has caught the issue.


And so they go through this quick session, maybe a quarter, maybe a couple months or more of figuring out, “How can we cut costs? Can we remove resources? Can we put these practices into place? Can we build some processes? Okay, now, everything's fine, right? We've managed to bring our costs back down. We managed to get rid of all of those EBS snapshots that were collecting dust and never to be used, so now we can go about business as usual again, right?”


And so then they continue on as if nothing has happened. And without making long term changes, those costs are going to rise again. And then all of a sudden, we're back in the same spot of, “Oh, no, our cloud costs have gone up, why did they go up? We did all these things to make sure that we didn't have run into this issue again. Why are our cloud costs going up again?” And the cycle just repeats. It's a really unfortunate kind of spiral.


Pete: I remember my time at a startup where we were under a series of really high growth, a lot of customers coming on the platform. And my favorite meeting ever was the CEO talking about our financials. And he mentioned that our gross margin was negative 175%, which for the non-financial folks, means that for every dollar of income negative 175% is being spent for that. You normally want that number to be positive if you want to have a successful business. And remember, the line he said is, “We are going to successfully go out of business with a gross margin that is negative one hundred and seventy”—whatever I said.


This is an important number that people need to think about. And what's amazing is that within a year, we had turned that around to be an extremely high gross margin because we started looking, and tracking, and bringing cultural change, and giving ownership to people to own these numbers. So, it's not just an engineering problem anymore. Everyone thinks that the Amazon bill is because your engineers built a certain thing, or turned on a certain type of instance. And sure, part of that is absolutely true, but I always like to say that your Amazon bill is the sum total of all of the decisions the business has made.


The business chooses what things to do and what order to prioritize revenue over technical debt. And all those decisions will impact the bill. It just so happens that it impacts the bill in a way that's so much more visible than in the data center world when you just bought all this stuff and let it sit there.


Jesse: Yeah, I think it's really fascinating because you ultimately end up with visibility into all of these other parts of the business that you may not have known about or thought about as clearly. Because if I'm looking at a massive spike in S3 spend, maybe that's because security has said we need to keep a certain amount of records for a certain amount of time for audit purposes. Well, as an engineer, that's not something that I necessarily focus on day-to-day or care about day-to-day. But from a security perspective, that's a majorly important business decision. But ultimately, it ends up impacting the engineering teams because it's their bottom line that's being spent.


Pete: Exactly. So, we're going to be taking the next many weeks to go through what we're calling The Unconventional Guide to Cost Management with a variety of different things that we see in the field that the most mature organizations are focusing on, and why they're focusing on them, and some actionable ways to go about this improvement in your company and their cost management strategy. And you're going to think to yourself, “Wow, that sounds really boring.” It's like, well, at some point, that CFO is going to stroll by your desk and want to know what's going on. Or you're going to be instructed to build a chargeback plan or a showback plan, and if you're not ready for that, it could be a little bit of stress to your day-to-day.


Jesse: Yeah, I mean, just as we highlighted, there's so many different drivers of costs when it comes to cloud costs. But we think that the main driver of costs is your architecture decisions, and the context related to those decisions. So, if you think about a highly regulated industry versus an industry that's maybe not as highly regulated, for example, you see a lot more data that needs to be kept for audit purposes, like I mentioned before. So, there's little examples of architecture decisions made as your business grows, that have a really unique impact on your cloud costs.


Pete: Yeah, I feel like this is where a lot of these automated cloud management tools really fall down is that they lack that context. They don't understand that those systems in us-west-2 is actually my DR site that I need to have running at all times because my audit and risk team has told me I need to do that. While the CPU is not being used, and I would love to shut them down, having them off until needed does not actually meet my risk requirements. That context is—


Jesse: Absolutely.


Pete: —what is so important here. So, this Unconventional Guide falls under what we have really identified within Duckbill Group, from working with all these clients, four main capabilities. These are levers that help influence cost within your organization. And these four main capabilities are architect, attribute, invest, and predict.


So, let's kick it off with architect. What does this mean? How you architect—what Jesse just said: how you architect your applications, are you using higher-order functions within Amazon. Specifically, are you using Lambda? Lambda increases the ephemerality of your systems; the less they're running when they're not doing anything, the cheaper your bill will be.


If you have T class instances, maybe you need a lot of memory allocated but CPU is very intermittent. How you're using, how you've architected, how your application was designed, either maybe specifically designed for the Cloud or just by accident as things evolve over time. But how you architect your application and the requirements that fit under that architecture is one of the main drivers of cloud cost. But attribute. What about attribute? What does that mean, Jesse?


Jesse: Yeah. I think it's important to call out that when we talk about these capabilities, architect is definitely one of the most broadly used and referenced capabilities that we see in terms of talking about cloud architecture and architecture decisions. But there are these other three capabilities that are important to highlight because they do also impact cloud costs. So, the attribute capability focuses on attributing cloud costs within your organization, whether that is to specific teams, maybe specific product lines, maybe specific business units. It really depends on how your organization structures itself.


But you want to be able to provide your cloud costs bottom line to each of these teams to say, “Okay, this team is spending this much money per month to run their application or their microservice.” When your organization can see where your AWS costs are going along business lines, you understand the context for that cost. You move away from a vague pain around your bill to making informed decisions about engineering investments. So, ultimately, you can build showback models or chargeback models so that you understand how much each team or each business unit is spending on the Cloud. And then ultimately that gets into your unit economics as well, which we'll talk about in a second. But accurate cost attribution really helps everybody in your business understand the costs of your business decisions.


Corey: This episode is sponsored in part by CircleCI. CircleCI is the leading platform for software innovation at scale. With intelligent automation and delivery tools, more than 25,000 engineering organizations worldwide—including most of the ones that you’ve heard of—are using CircleCI to radically reduce the time from idea to execution to—if you were Google—deprecating the entire product. Checkout CircleCI and stop trying to build these things yourself from scratch, when people are solving this problem better than you are internally. I promise. To learn more, visit circleci.com.


Pete: I think that's an important one, too, working at SaaS businesses, usually the SaaS product is going to grow as the customer count grows. But one thing that really I find a lot of product teams fall down on is they don't accurately understand the cost of product decisions. Product decisions often is going to be a big driver of your spend. If a product team has a certain requirement to keep data for long periods of time but they're not going to charge customers more for that, you've got a big breakdown there. And so, I've always had a lot of success attributing in my applications at the product level and bring that as ammunition to different product meetings and say, “Hey, I'm looking at your product backlog, and these are the four projects we're working on.


But those four projects represent 10% of our spend, and your fifth project represents 80% of our spend”—or something crazy like that—“Maybe we want to readjust this.” Maybe not. Maybe the business doesn't need to do that or want to do that. Those are all different things that, if you don't have that information, you can't ask those questions. Another of the big levers we mentioned earlier, investing.


Making the biggest commitments that you can to Amazon—to really any cloud vendor, but specifically Amazon—will reduce the cost of running on Amazon. If you can make commitments, whether it's an upfront savings plan—that's a really simplistic way of making a commitment to Amazon to reduce your spend—the longer the commitments you make, if you have confidence that you're probably not going to be leaving Amazon anytime soon unless you're forced to—like they just turn you off—speaking for no company in particular—but that aspect of, you're going to be on Amazon for three years, you're probably going to be on Amazon for five years. If the company can make a five-year commitment over a three-year, you will save more money as part of an enterprise discount program, things like that. So, the contracts that you enter into, the longer that you can extend these, the larger that you can make them without overextending, the better off you're going to be. So, those investments are big ways to move those levers. But they all lead to this final lever, which is predicting your spend.


Jesse: Yeah, and I think it's also important to call out that when you think about investing in AWS, it's not just about putting money down to get a discount, it's about investing in your relationship with your AWS account team, your account manager, and your technical account manager, and whoever else you work with from your account team want you to be happy with AWS; obviously, they want you to continue using AWS. So, building a good rapport with your AWS account team will take you so far. It can lead to a lot of really great conversations around best practices, around architecture discussions, but also it could potentially lead to your AWS account team going above and beyond to help you get that extra percent discount when you are renegotiating your EDP or renegotiating your private pricing addendums.


Pete: Yeah, that's actually a really great point is, the better relationship you have, the more potential savings or just overall improvements that could be identified. I mean, that is your one way to get into the various Amazon AWS teams is via that account management. So, the better that relationship looks like, the more positive it will be for the business.


Jesse: Yeah. So, last but not least, we have predict. Predict is all about predicting your future spend. It's all about forecasting future spend. And this is the one that makes all of the finance team just drool.


If you can share prediction models with your finance team predicting your engineering spend, or your cloud spend out six months to a year in advance, they are going to love you forever and ever. It is absolutely worth your time to work on building these models. But in order to build those models, you need to understand what does your spend actually look like right now. What does your spend look like per business unit or per team, like we talked about before. Which is where the conversation about the attribute capability comes into play because you need to understand which teams or which business units are spending what on the cloud first, before you can accurately predict what those teams are going to spend in the future.


And most importantly, this gets to a conversation of unit economics, which we will dive into in more detail in a later episode, but it allows your business to understand how much does it actually cost per user who is using your application, for example? And how much can you competitively charge a company or a user so that they get a decent price for your service, but you also are able to make a profit?


Pete: Yeah, if you want to be the hero of your sales team and probably your CEO as well, having this ability to forecast when your sales team is out there trying to make a really aggressive offer to bring in a new customer. If they have the confidence in their ability to discount, they can make this really competitive offer to try to bring in new business. And that confidence is brought about by the fact that if you do understand your cost per unit, your ability to deliver the service to that specific customer, they can go in and make the best possible savings discount without risk to the business, which then makes, again, all the executives and the board happy. So, that is the superpower [laugh] of cost management.


So, those four main capabilities that we talked about: architect, attribute, invest, and predict, we're going to tie into those over the next many weeks as we talk about this Unconventional Guide. All these different tips that can help you impact your spend, hopefully in a positive way, within the business. So, we're going to be diving into these. We want to answer your questions along the way and we want to take the time to break out and answer some of those questions diving deep for you as you learn these concepts and bring in your own complexity within your businesses.


Again, you can always go to lastweekinaws.com/QA. That is to ask us a question. Feel free to add your name if you like, but you can be totally anonymous, too. If there's something that we're talking about that you want us to clarify further, we're going to break up these sections by doing some listener Q&A and diving into some of those, explaining some common practices that we've seen that hopefully will help guide you as you as you work on this process.


So, again, if you have any questions, hit us up, lastweekinaws.com/QA. We will collect those, we'll dive into them, and we're already been getting some really great questions that we're really looking forward to dedicating some future episodes on. So, this is AMB Friday From the Field. I really appreciate taking the time.


If you have enjoyed this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your personal podcast platform of choice, and then go to lastweekinaws.com/QA and tell us what you hated about it, or just give us a question. We'd love to read it. Thanks so much.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 15 Jan 2021 03:00:00 -0800
Parler’s New Serverless Architecture

Special thanks to Alice Goldfuss for this week’s awesome title!

Want to give your ears a break and read this as an article? You’re looking for this link.

Sponsors



Never miss an episode



Help the show

What's Corey up to?

Wed, 13 Jan 2021 03:00:00 -0800
Insurrection Week
AWS Morning Brief for the week of January 11, 2021 with Corey Quinn.
Mon, 11 Jan 2021 03:00:00 -0800
Kubernetes is the Most Expensive Way to Run a Service

Transcript
Corey: Software powers the world. LaunchDarkly is a feature management platform that empowers all teams to safely deliver and control software through feature flags. By separating code deployments from feature releases at scale, LaunchDarkly enables you to innovate faster, increase developer happiness, and drive DevOps transformation. To stay competitive, teams must adopt modern software engineering practices. LaunchDarkly enables teams to modernize faster, Intuit, GoPro, IBM, Atlassian, and thousands of other organizations rely on LaunchDarkly to pursue modern development and continuously deliver value. Visit us at launchdarkly.com to learn more.


Pete: Hello, and welcome to the AWS Morning Brief. I’m Pete Cheslock.


Jesse: I'm Jesse DeRose.


Pete: And we're back yet again. We're well into 2021. I mean, about a week or so, right?


Jesse: I'm excited. I'm just glad that when midnight struck. I didn't roll back over into January 1st of 2020.


Pete: Yeah, luckily, it's not a Y2K scenario. I don't think we have to deal with the whole date issues until, what, 2032 I think, whatever that the next big Y2K-ish date issue is going to be. I'm hopefully retired by the time that that happens.


Jesse: That's future us problem.


Pete: Yeah. Future us problem, absolutely. Well, we've made it. We've made it to 2021, which is a statement no one thought they were going to say last year at this point.


Jesse: [laugh].


Pete: But here we are. And today, we're talking about an interesting topic that may bring us some hate mail. I don't know. You tell me, folks that are listening. But we're seeing this more and more in our capacity as cloud economists working with clients here at The Duckbill Group, that folks who are running Kubernetes—whether it's EKS, or they're running it on EC2 using maybe, like, an OpenShift—are actually spending more than people who are using other primitives within AWS.


So, we wanted to chat a little bit about why we think that is, and some of the challenges that we're seeing out there. And we would love to hear from you on this one. If you are using Kubernetes in any of the ways that we're going to talk about, you can actually send us a story about how you're doing that and maybe answer some of these questions we have, or explain how you're using it. If you go to lastweekinaws.com/QA to ask us questions—not quality assurance—but go to QA for asking us questions. You can put in your information, you can add your name, it's optional if you want. You can be completely anonymous and just tell us how much you enjoy our wonderful tones and talking about technology. So, Kubernetes. Why is this the thing, Jessie?


Jesse: I feel like when it first came out, it was the hot thing. Like, everybody wanted Kubernetes, everybody wanted to be Kubernetes, there were classes on Kubernetes, there were books on—like, I feel like that's still happening. I think it has amazing potential in a lot of ways, but I also feel like… in the same way that you might read the Google SRE book and then immediately turn to your startup team of three people and say, “We're going to do everything the way that Google does it,” this isn't always the right option.


Pete: Feel like the Google SRE book is, like, The Mythical Man Month, which is, the book that everyone wants to quote, the name of the book, but none of those people have ever actually read the book.


Jesse: Yeah, there's lots of really great ideas, but just because they're great ideas that worked well for a large company at scale doesn't necessarily mean that they're going to be the same right ideas for your company.


Pete: And also, we're both fairly grizzled former system administrators and operators; Kubernetes is not the first, kind of, swing of the bat at this problem. I mean, we've had Mesos which, it's still around but not as hip and cool; we've had OpenStack. Does—remember when all the Kubernetes people were all like, “Nope, OpenStack is going to be the greatest thing ever.” So, needless to say, we are a little jaded on the topic.


Jesse: You can't forget about Nomad, either, from HashiCorp built cleanly into HashiCorp’s Hashi stack with all of their other amazing development and deployment tools.


Pete: Yeah. I mean, this is a problem that people want to solve. But in the rise of Cloud, on Amazon I always struggled with why it was needed. And we're going to talk a little bit about that.


So, again, what is Kubernetes? I hope people are listening that would know this, but maybe not. It's an abstraction layer for scheduling workloads. It's the solution to the Docker problem. Like, a container is great. I have a container, it is a totally self-contained application, ready to go, my configuration, my dependencies. And now I need a place to run it. Well, where do I run this container? Well, pre-Kubernetes, Jessie, you'd probably use something like ECS—the Elastic Container Service—might be a way that you could schedule some workloads.


Jesse: Or maybe if you just wanted to run a single virtual machine somewhere and run that container in the virtual machine, you might do that as well.


Pete: Yeah, that was how a lot of the earliest users of Docker were just running Docker: they were just running the containers as applications—because that's what they are—on their bare EC2. They would just run some EC2 and run a Docker container on there. And there were benefits to that. You got this isolated package deployed out there not having to worry about dependencies. You have to worry about having the right Python dependencies or Ruby dependencies.


It came with everything it needed, and that was a big solution. Now Kubernetes, I think, brings this really interesting concept that I like. It's this API that theoretically you could use in a lot of different places. If you now have this API to deploy your application anywhere there's a Kubernetes cluster, does this solve vendor-lock-in? Could you use Kubernetes to solve some of these issues that we see?


Jesse: You could use Kubernetes to solve vendor-lock-in in the same way that you could use multi-cloud to solve vendor lock-in. Again, it is a solution to the problem, but is it the right solution for your company?


Pete: That is always the question I feel like I would ask folks when they were using Kubernetes is, I would always ask why they were using it. I honestly will say I never got—I don’t want to say wouldn't say never; that's not fair. I rarely would get a good answer. It was often like a little bit of operational FOMO—you know, the fear of missing out on the next hottest thing, which of course, that's never a good way to pick your architecture stack. Now, that being said, at a previous company, we were investigating Kubernetes to solve a problem with our stateless applications—because I in no way trusted it to run anything stateful.


None of my databases I wanted on it. But it is a great way to put more control into my developers’ hands-on deploying their applications. We ran predominantly C class instances on EC2. And those C class instances were a CPU heavy data processing application, and so it seemed to make a lot of sense to get a lot more efficient bin packing, to let the developers be a little bit in control of how much memory and CPU they're going to allocate to there. But at the end of the day, we never ended up going down that path because, again, for us and our architecture, just continually running the correctly sized EC2 with some of the other abstractions that we made just made the most sense.


But if I was in a data center, if I had legacy data center hardware, I think Kubernetes, that’s, like, the dream API for me. It's been years since I've been in a data center, but having that way of creating an API for my physical data center assets that could be a similar API—we're not going to say the same because we know that's not true—but a similar API in a cloud vendor, like, that can be pretty compelling, right?


Jesse: Yeah, it really does get to that idea of abstracting away the compute layer from the developer so the developers don't have to worry about, “What kind of infrastructure do I need? What kind of resources do I need? Do I need to provision this C class instance on-demand? Do I need to provision an M class instance, on-demand? Do I need to provision this other thing on-demand? And what does all of that look like?” Ultimately, you can give the developers the opportunity to say, I know that I need this much vCPU credits—or units—and this much memory. I don't care where it runs past that. Go. And let them, again, focus on their development cycles more so than the infrastructure component of the application. I think it's a great opportunity.


Pete: Yeah, absolutely. Anytime you can have an engineer just ship their application without having to worry about what's behind it, that's a win. That's why services like Lambda and services like Fargate, you just push the thing to the Cloud and let it run, are really effective ways of unblocking those developers and not having to deal with any sort of, kind of, operational thoughts, [laugh] for lack of a better term. They can just push those applications. But let's dive in a little bit Amazon-specific, right because we were talking about how Kubernetes is the most expensive way to run a service.


But I would say: in Amazon. That's the catch here because there are so many ways of running things in Amazon. One thing that we see, often, is—we see it internally into our clients a lot is, the more ephemeral you can make an application, the cheaper it will be to do that. So, not to put you on the spot Jessie, but what does that really mean when we say that? When we say that ephemerality is kind of a driver of cost?


Jesse: Yeah. Ultimately, we see a lot of clients who treats AWS like another data center. They treat AWS, the same way that they would treat a bare-metal physical server sitting in a data center somewhere, which to some extent, arguably, it is, but when you think about development and resource usage, ultimately, AWS provides so much more than a data center in terms of cloud-native resources. And when I'm talking about cloud-native resources, I specifically mean things like ECS, Fargate, Lambda, where developers can quickly deploy and iterate on an application without relying specifically on this massive compute infrastructure under it. So, ultimately, a company is able to deploy things in such a way that code is only around for as long as it needs to be around.


This is moving towards the idea of request-based workloads, especially with Lambda, where a piece of code for a workload will be fired whenever a request comes in, and then fire whatever other code it needs to do whatever it needs to do in terms of its workload, and then it's done. That's it, and you only pay for the amount of time that that request took. Whereas, when you are running workloads on an EC2 on-demand instance, you're paying for the entire amount of time that that instance is online and running, whether the application is actually performing any requests or not.


Pete: Yeah. I mean, it seems common knowledge nowadays, but I'm still going to repeat it is that with the exception of a T class instance, if your CPU usage is anything below one hundred percent, there is technically waste, right? There is usage you're not using. Now, obviously, it gets a little more complex: you've got memory, maybe you're using all of the memory because you've got some gnarly Java application and very little CPU, but then, in that case I'd say, yeah, you're still probably using the wrong instance. Move to a T class instance.


But again, your mileage may vary. There's a reason there's hundreds of different EC2 instances because there's so many different workloads. But yeah, that goes back to that statement, Jesse, like you said. The more ephemeral your applications, the more they can survive any sort of workload interruption—leveraging spots and things like that--, the more inexpensive it is. The Kubernetes story that I think is interesting because we've seen this now quite a few times, especially across this past year is, when it comes to EKS, you're now abstracting away the instances underneath there.


Well, I think one question I always ask is, “Well, how many clusters are you running?” All of your applications don't currently run on a single instance type. Do you run one cluster with a series of instance types, and start scheduling people to instance types based on their workloads? What about your stateful applications? Have you brought those over yet?


I have a hilarious story from a friend of mine who was doing a very large-scale Kubernetes deployment where one hundred percent of applications must go to Kubernetes. That was the initiative. And as it turns out, developers, they don't know what to put in the YAML. They don't know all the things they need to fill out. And so, they would deploy their database cluster with their application, and then it would get rescheduled somewhere else, and they'd be like, “Hey, where's my data?”


Jesse: Oh, no.


Pete: Well, they forgot to make the disks persistent. They forgot to do that setting. But, Jesse, I think, to your point, you had mentioned before all these different services that you can use that is giving you kind of a Kubernetes-like experience. The one place where they all fall down is the simplicity of a YAML file that you define, and you just ship it off, and your magic happens. That has, I think, the thing that has made Kubernetes win is like, “I just want to do this YAML and make it work,” versus, “How do I deploy a Lambda function in 2021?”


Jesse: Yeah.


Pete: Crickets. I mean, you'll either have like, “I don't even know where to start. Do I need Terraform? Do I need CloudFormation? Can I use the Amazon CLI? Can I click around through the UI? Am I really going to let my 1000 developers do that?” Way too many question marks versus, well, I have a YAML file that I can ship via my CI system. That is pretty compelling.


Jesse: I think it's also worth calling out really briefly when we're talking about workload types versus stateful applications. We've seen both stateless and stateful applications run on Kubernetes or run on containers, and we've seen issues like Pete mentioned with his friend’s story of data gone missing because an EBS volume didn't persist, but I just want to highlight, really quickly, please don't run stateful applications on your Kubernetes infrastructure or on your container infrastructure to begin with. I know that there's a lot of benefits to that on the front end; there's a lot of shiny red bows, but there's so many other things to think about that aren't highlighted up front, that are kind of hidden behind the scenes, which I know we'll get to, but it is rarely the right solution for stateful workloads.


Pete: You know, just a reminder to the folks listening, you can go to lastweekinaws.com/QA and give Jessie your feedback. [laugh].


Jesse: Any time. [laugh]. I can't wait to see all the hot takes on that one.


Pete: Yeah. I’m going to agree with you on that one. I don't know if I would personally trust a stateful workload in Kubernetes versus somewhere else. Again, speaking Amazon-specific. If I'm going to run a SQL cluster, it's going into RDS; that's going to be my abstraction layer. Elasticsearch, I've had way too many years of experience with Elasticsearch, I would feel like I would have more control running it on my own EC2. And honestly, if I had very little experience with Elasticsearch, I would use an Amazon Elasticsearch service, right?


Corey: If you're like me, one of your favorite hobbies is screwing up CI/CD. Consider instead looking at CircleCI. Designed for modern software teams, CircleCI’s continuous integration and delivery platform helps developers push code with undeserved confidence. Companies of all shapes and sizes use CircleCI to take their software from bad idea to worse delivery, but to do so quickly, safely, and at scale. Visit circle.ci/screaming to learn why high performing DevOps teams use CircleCI to automate and accelerate their CI/CD pipelines. Alternately, the best advertisement I can think of for CircleCI is to try to string together AWS’s code, build, deploy pipeline suite of services, but trust me: circle.ci/screaming is going to be a heck of a lot less painful, and it's where you're ultimately going to end up anyway. Thanks, again to CircleCI for their support of this ridiculous podcast.


Pete: So, I feel like a lot of the scenarios that there's rare scenarios where, again, on Amazon specifically, you would want to run a stateful workload. I think the biggest thing that we have seen has been because of most people are deploying these Kubernetes clusters across availability zones, they're incurring significantly greater charges in that data transfer. Amazon charges for every type of data transfer there is with very few exceptions. And cross-availability zone data transfer continues to be one of the places our customers spend the most amount of money when it comes to data transfer. Oftentimes, they don't actually know what is taking up most of that service.


So, I imagine this scenario: you're moving from Cassandra—and I have a hilarious remembering story of an account manager years ago who told me, “I could look at your Amazon bill, and I could tell if you're running Cassandra or not because of that data transfer spend.” And again, I laughed at the time, and now, having seen hundreds of Amazon bills, yeah, I could do the same thing. I could absolutely tell you if you're running Cassandra because it's replicating its writes, it's doing all these writes across every AZ. It's incredibly expensive to run something like that. But imagine how much harder it is to figure out where your data transfer spend is going when you have a variety of workloads in the shared cluster, now.


Your Amazon and EC2 tags are probably going to be largely irrelevant, so then how are you going to track the spend of your applications on Kubernetes? And again, please go to lastweekinaws.com/QA. Tell us how you're solving this now or even if you're thinking about it because I've yet to see any solutions out there that truly solve this problem of defining, and understanding, and tracking your unit economic spend down to that Kubernetes level. If I could just grab a tag of my Cassandra cluster, I'm going to see all the costs associated with that tag, like data transfer, and EBS, and things like that, but how is that going to happen when it's on a Kubernetes cluster?


Jesse: I feel like so far, we've actually only seen this problem solved by tooling, whether that is a third-party tool or an in-house tool, that is specifically designed to look at your Kubernetes clusters and give you insights about usage: what is over-provisioned what is under-provisioned, et cetera. There's no easy first-class citizen that I've at least seen so far that really gives us this information, either within Kubernetes or within some of the AWS specific services like EKS and ECS.


Pete: Right, yeah. I could imagine that if you were a enterprise that is very mature in your cloud cost management—let's say you have a really high percentage of tags across your infrastructure, and you're able to really granularly see spend by product, by team, by service—I was doing a lot of this stuff at many of my previous places, really analyzing that spend down. My biggest fear of Kubernetes is losing that visibility of—I’ve got a Kubernetes cluster of, I don't know, 10 EC2 instances, how do I figure out which applications on that cluster make up what percentage of that spend? Would I break it out by CPU used? What about memory used? Just, how would you allocate that spend across that cluster? Or would you then say, “Oh, well, maybe I'll break out a cluster based on each… product? Engineering team?” Where does that go?


Jesse: Not to mention, this is assuming that all of the infrastructure has the appropriate user-defined cost allocation tags associated with it, whether that is EC2 instances or EBS volumes. Because in a lot of cases, that's the other problem; maybe the cluster is all tagged with the tag cluster-1 because the infrastructure team deployed it and they know to use user-defined cost allocation tags. So, it's all tagged as cluster-1 or microservice-1, but within that, who knows? Maybe the developers are aware enough to deploy tags through their deployment pipeline through automation. Maybe they're not. Maybe they need to attach EBS volumes and those EBS volumes aren't getting user-defined cost allocation tags associated with them. There's a wide variety of gray area there.


Pete: Yeah, exactly. These are questions that any team should be asking themselves before they undertake any new kind of platform deployment is, understand how they're going to get answers to these questions because even if you don't care about the question, right? You're an operator, you want to spin up Kubernetes because you want to get that resume beefed up for that next role, right? Like, I’m a Kubernetes admin. Like, that's an instant 20 percent pay bump at the next place.


Jesse: Yeah.


Pete: [laugh]. But the reality is, is that at some point, someone at a higher pay grade than you is going to be like, “Why is our bill so high?” And you're going to look at this Kubernetes cluster and go, “I don't know how to answer this.” So, any advanced planning you can do to understand how to allocate that spend across however many applications, the better off you'll be in the future. One of the best things that I've learned recently is the concept of taking a cluster, breaking it into units, almost, where you'll have maybe a certain instance type you're going to use for that entire cluster.


If you start breaking that out to the cost per vCPU, cost per memory, and you almost break it into these usable chunks. Then each usable chunk, this cluster of 10, maybe breaks into, like, 20 applications of a very specifically defined size. Now you have a unit that you can apply, even to your applications of different sizes. It's almost like a normalized unit like you might do with an instance reservation. But breaking them down to that world.


To be honest, I'm not sure how you would handle that if you had an Kubernetes cluster of different instance types. Unless you did, again, some additional tagging in there. And you can see, the complexity gets pretty high here, especially if you're not planning for it before you deploy it. So, that's probably the best advice that I could think of is, plan in advance about how you're going to figure out who is the largest consumer? Are you going to do it by monitoring maybe the YAML level? Are you going to try to front-load additional tags? Or maybe use a third party, maybe use your monitoring system? I'm not sure.


Jesse: Absolutely.


Pete: These are all helpful things right, Jesse, that eventually someone in finance is going to come over to you and say, “What is this?”


Jesse: Absolutely, I think it's worth noting that Kubernetes not only has the overhead cost of managing the infrastructure—if you choose to manage an open-source version on EC2 instances on-demand versus something like EKS—but there's also overhead in terms of managing the optimization of workloads on it, and managing the attribution of costs for workloads on it. So, be mindful that it's not just expensive in terms of the amount of money you're actually spending on AWS, but the amount of time you're spending on this infrastructure as well, from an engineering perspective.


Pete: Well, I can't think of anything that is more insightful to say than that, which usually means we've reached the end of this podcast. So, if you have enjoyed this podcast, please go to lastweekinaws.com/review, and give it a five-star review on your podcast platform of choice whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice and tell us why you love Kubernetes so much. Don't forget, we would love to hear your questions about anything related to cloud cost management, how you do cost allocation on something like Kubernetes. Go to lastweekinaws.com/QA, shoot us a question and we will pull those together and we will answer them on the air. Thank you very much.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 08 Jan 2021 03:00:00 -0800
Terrible Ideas for Avoiding AWS Data Transfer Costs

Want to give your ears a break and read this as an article? You’re looking for this link.

Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 06 Jan 2021 03:00:00 -0800
Amazon Lookout for 2020
AWS Morning Brief for the week of January 4, 2021 with Corey Quinn.
Mon, 04 Jan 2021 03:00:00 -0800
AWS Wishlist and Chrismahanukwanzakah Part 2

Links


Transcript
Corey: When you think about feature flags (and you should), you should also be thinking of LaunchDarkly. LaunchDarkly is a feature management platform that lets all your teams safely deliver and control software through feature flags by separating code deployments from feature releases at massive scale (and small-scale too), LaunchDarkly enables you to innovate faster, increase developer, happiness (which is more important than you think), and drive transformation throughout your organization.


LaunchDarkly enables teams to modernize faster. Awesome companies have used them, large, small, and everything in between. Take a look at launchdarkly.com to learn more and tell them that I sent you. My thanks again for their sponsorship of this episode.


Pete: Hello and welcome to the AWS Morning Brief. I am Pete Cheslock.


Jesse: I'm Jesse DeRose.


Pete: We are welcomed yet again with Amy Negrette.


Amy: Hello.


Pete: We are here. We made it. It is actually 2021.


Jesse: I can tell you flying cars: definitely a thing. World peace: we're close, we're so close.


Pete: We're so close. Well, guess what? We made it, we survived 2020. And with it, we brought with us part two of the #awswishlist. So, this is where we went through—especially as leading up to re:Invent and getting through re:Invent—we went through and looked at the Twitter hashtag of #awswishlist so that we could pick out some of our favorite things, some #awswishlist items that we think are important to us, or just interesting in their own right. We'll include the link to these tweets in the [00:01:57 show notes].


So definitely go check that out, and you can check out the conversation, or maybe follow some of that to see when things actually come around. But yeah, we'll just walk through some of the things we found that were pretty interesting and chat about why we hope Amazon includes them into a future release. So, one thing that I saw which I thought was pretty interesting because I run into this problem also, is a way of downloading data from various third party locations directly into S3, Dynamo, or some sort of data store location. Essentially, it'd be awesome to just completely get rid of having services around, or Fargates, or Lambdas set up for downloading data from places that—how cool would it be? And this is, again, not an enterprise-y type feature, but just, like, a personal thing of how cool would it be to be, like, I want to take this ISO from a place and just put a URL in S3 and say, “Put that thing in this thing,” and call it a day. So, again, a personal complaint of mine plus, also, someone else tweeted it, so there's two people out there that want this—at least—so therefore Amazon, you got to build it for me.


Amy: Those are the rules.


Pete: Those are the rules. Right. Right, Amy, those are the rules.


Jesse: And I feel like, let's be honest, that ISO that you want to download anyway is probably living in S3 somewhere else anyhow. So, it's just moving bucket to bucket.


Pete: Someone has that, you know, Slackware ISO that I've been looking for, from, you know, 2001. It's in someone else's bucket; just let me have it myself. Exactly. Amy, what did you find in your discovery of the #awswishlist hashtag?


Amy: This is a thing that I think really should be on any of these on-demand pay-as-you-go services because AWS really targets those [00:03:48 unintelligible] markets for a lot of their serverless deployments. And this actually came from one of my friends who had this problem on Twitter, where you need to be able to set a maximum on on-demand spend, let's say in his case, Dynamo. So, you don't hypothetically build in a loop and spend a whole bunch of money.


Pete: Yeah.


Amy: And really, it should be in anything that does that. If it's not telling you something where I'm only wanting to run this much because it's on-demand, then you should be able to control that spend somehow.


Pete: And with the—what is it—millisecond billion on Lambda, you can get really granular bills for your poorly architected Lambda functions.


Jesse: I feel like computers are the best because they'll do exactly what you want them to do, except for when they do what you tell them to do and not what you actually want them to do, and that drives me absolutely insane. So, I'm with you. I think that this is a great opportunity.


Amy: That problem will be solved when the robots take over.


Pete: [laugh]. One of my favorite discoveries of doing our kind of Duckbill cost optimizations where we dive into people's spend and help them architect things new was finding a Lambda function that was taking longer and longer to execute—meaning, costing more money—by putting more and more data into a poorly configured Dynamo table that was also causing it to take longer and longer. And so not only did you have a Dynamo table that was poorly configured, taking this data and taking longer to do it, you were just getting a hit on both sides. It happens.


Jesse: That hurts my soul.


Pete: So, what’d you find, Jesse? What was some of the good wishlist items that you're hoping for in 2021?


Jesse: So, I come from a background of a lot of infrastructure as code I've worked a lot with Terraform, I know enough about Chef to be dangerous to your production environment. One thing that I saw a couple people tweet about that I would love to see is mock AWS API endpoints for, effectively, unit tests for a lot of infrastructure as code. Because if you think about when you're building infrastructure as code, the only way that you can really test it is by running it, by actually seeing, “Can I actually create the resources that I think I'm creating with this infrastructure as code content?” So, I would love to see maybe a feature flag for AWS services through the API where you can say, “Hey, don't actually create this RDS database or this EC2 instance, but just return the results as if I did create it. Maybe leave the Instance ID blank or something like that.” And then you, in writing your unit tests, can confirm all the details that you would expect to see in that response.


Pete: I feel like there was a—Atlassian, maybe, had a project that was something like this, some sort of a way of unit testing these things. Again, it was something on GitHub, so even if it was associated with a large publicly traded enterprise, I'm sure it's fallen into disrepair at this stage.


Jesse: [laugh]. I will say I found an open-source tool looking into this one, called LocalStack that allows you to basically spin up an instance on your local machine that acts as the AWS API endpoint so that it actually creates this mock endpoint for you locally on your machine. But effectively, I'd love to see that as a first-class citizen in AWS.


Pete: I think the way they're solving this is just having people buy Outposts instead, and having them just—“Oh, just run it against your Outpost. It's right there. It's all local.” So, one thing that I saw, which I had a really interesting reaction to because it's related to a previous company I worked at, so this person said, “I’ve got a lot of items in my #awswishlist, but a serverless version of Elasticsearch on AWS—or a serverless service that is Elasticsearch compatible—would certainly be at the top.” I don't know if this is sponsored by ChaosSearch because—


Jesse: [laugh].


Pete: —sponsorships are handled separately than the recordings. And this is definitely not me, specifically, sponsoring them other than the fact that I used to work at ChaosSearch, and a tool that we had, the technology that they had created which was so cool was this kind of concept. So, it's just really interesting to see other people getting to that point. But I think what's interesting about it is that Amazon has a serverless Aurora, which we use within Duckbill. It's a really interesting technology.


There's a V2 version that has come out in part of re:Invent. And so this is a technology that is growing inside of there, having these, kind of like, on-demand databases for when I don't need a thing running all the time. I think Athena is kind of that concept right of ad-hoc queries. I don't need to run in SQL, I just want to run some queries every once in a while. It would be really fascinating to see this type of technology become more prevalent.


I may not want to run Elasticsearch all the time. Let's be honest, no one wants to run Elasticsearch any of the time, [laugh], but maybe just a serverless version, I’ll end up paying more for, in general. Like Amy has said in the past, you'll have the loops causing additional spend, I'm sure that will increase your Elasticsearch—serverless Elasticsearch spend, as well. So, what was another item that you found, Amy, in your wishlist bag?


Amy: There is another type of database that users would like to see go serverless, and that's graph databases laid on top of Dynamo, which does make sense because a lot of times you don't want to have to spin up your own things to experiment with graph. And graph is a type of thing that once it's fully implemented, you can drop it on top of whatever your data source is. And AWS does have a graph service, Neptune, but just like all native services that are offered by AWS, it is a little much if you're just trying to figure out if it's the thing that you want. No one wants to—they look at that wall of Neptune documentation, it's like, “Maybe I don't need it right now.” And then they put it—it becomes one of those backlog things you learn later.


Pete: Or someone spins it up and says, “Oh, I’ll play around with this later.” And then—


Amy: And they leave it on, forever.


Pete: Yeah. Until they call us, and we find it for you. [laugh]. Jesse, what's in your bag? What's in your wallet, Jesse?


Jesse: [laugh]. If we want to continue my theme of scaring AWS engineers, I would love to see a better user experience for the Personal Health Dashboard. So, between the recent kinesis outage that prevented the public status page from being updated, and some of my own experience, basically being the canary in the coal mine, when the status page said, “Oh, yeah, everything's fine. All systems are go. It's good. You're fine.”


But I've been the first person to call out, “Hey, I think there's actually something there going on.” I struggled with that public status page. And I know, I should use the Personal Health Dashboard more, but I just feel like it doesn't provide as much value as I want. I feel like it tells me things that I already know from other official sources, or doesn't tell me anything at all. So, I would love to see something better with the Personal Health Dashboard, maybe more engagement, maybe more or different alerts in the dashboard. Something that makes it more beneficial for me as an AWS customer.


Corey: This episode is sponsored by our friends at New Relic. If you’re like most environments, you probably have an incredibly complicated architecture, which means that monitoring it is going to take a dozen different tools. And then we get into the advanced stuff. We all have been there and know that pain, or will learn it shortly, and New Relic wants to change that. They’ve designed everything you need in one platform with pricing that’s simple and straightforward, and that means no more counting hosts. You also can get one user and a hundred gigabytes a month, totally free. To learn more, visit newrelic.com. Observability made simple.


Pete: Not to speak for you, but I think what you want is the Amazon simple status page, which is a simpler status page than the current status page that goes down a lot. I want to see a thumbs up or a thumbs down emoji. Is Amazon? Working thumbs up. Is one part of Amazon down? Thumbs down, right? Just, that's all I want to know. Because if it's something's wonky at Amazon. Is the thumb up or the thumb down. Just give me that.


Jesse: Yeah. Think about it: the website, is it down for everyone else, or just me? I just need to know, just a real quick, hot take. “Is it just me?” “Yeah? Everything else? Okay, cool.” I'll keep troubleshooting my own thing internally. Is there something bigger going on? Okay, cool, then I'll just let AWS sort out whatever they need to.


Pete: So, one of the last things that I had on my list, which I really liked this one because it can hopefully simplify a lot of the more complex network scenarios that we see is a all Amazon services private link. So, you've got gateway endpoints, right? Those are Dynamo and S3 only. Those are free. It's great. They're free services. How often do you get that?


But then you have these interface endpoints for other Amazon services, maybe like Kinesis, or SQS, or something. Can we simplify this, where I have my stuff in a VPC that can't go to the internet in any way and it wants to talk to something inside the Amazon ecosystem? Can I just have my all services one—sure, I'm sure you're going to charge me, like, $800 per gigabyte to use it, but it will at least allow me to have, like, one of these things versus one per every service per every VPC per every account that I have. So, I think that would be a great one. And that would definitely solve a lot of folks’ pain out there.


Jesse: I can speak from a little bit of experience here working with a client who basically had a standardized template that every new VPC they spun up, they spun up some number of private link endpoints as well for data that they knew was kind of the standard traffic going in and out of each VPC. So, several months down the road, they now have multiple VPCs with different levels of data traffic; some of them are actively using all of these private links, some of them are using none of the private links, and they're just hanging out there. So, in a lot of cases, these private links are just costing an hourly spend and not doing anything, whereas in some cases they're actively being used, and maybe they need more. I would so love to just see all of that consolidated into a single private link endpoint that all data traffic goes across. Like sure, I get that you're probably going to charge me for this level of convenience, but I would rather that then the networking headache of how do I make sure that this data gets transferred between VPCs without going out through the public internet?


Pete: Yeah, absolutely. Anything left in your grab bag of #awswishlist items, Amy?


Amy: There is one thing that I want, and it's mildly frustrating because this is a source of most of my monitoring problems, that there's no way to tell on AWS when a Lambda has failed. And not like, it threw a bunch of errors and then it closed itself because that’ll show up in CloudWatch real easy, but if it times out, the way you know it timed out is because the runtime will be one second short of whatever maximum you set it to, and you have to find out what that time is and just find anything that close within that range. It's infuriating and confusing.


Jesse: Absolutely.


Amy: Someone literally put, “Just give me every time that a Lambda has failed.” I'm like, “Yeah, that would be super nice.”


Pete: So, maybe we ventured into the personal grievances section of our #awswishlist because, Amy, that sounds like a personal grievance, and it's a great one—


Amy: That wasn’t me.


Pete: [laugh].


Jesse: [laugh].


Amy: Mine is completely different. [laugh].


Pete: [laugh]. Well, Jesse, do you have any personal grievances, things that you're just like—in your usage of Amazon if this one thing was built in 2021, it would make my year.


Jesse: You know what? I'm going to go really easy because I think I've been really, really hard on AWS through most of this list. I'm going to make this one really easy. All that I want is the most important feature that every application should have, by this time: dark mode in the console.


Pete: [laugh].


Jesse: That's all I want. I just want to be able to log in and see that beautiful, beautiful, dark-shaded background against all of that bright text. Now, I get that the color scheme might need to change a little bit, so that might be something that we have to involve product for, but it's so close. We’re so close.


Pete: We'll get there 2021. Maybe they’ll announce it in preview for re:Invent 2021. So, mine was a little bit more specific, and maybe on the nose, considering we spend so much of our time with the Cost and Usage Report and so much of our time in Cost Explorer. But one thing that I would really love to see—I mean, if there's a disparity between the CUR, between the Cost Explorer. And it really comes down to resource IDs.


When you turn on the CUR, you have the option of turning on a resource ID, and let's be honest, you want to turn that on because when you do that, you can start, basically, getting around a lack of tagging. If, for example, you wanted to create a report of your bucket usage and break out the usage of that bucket via spend—like, what percentage of this bucket spend was in requests and what was in storage—the only way that I'm really aware of being able to find that easily without per-bucket tagging and that tag enabled as a billing thing is inside the Cost and Usage Report [00:17:24 unintelligible] with the resource ID. Now, maybe this is—I don't want to say as simple as because the billing team handles trillions upon trillions of records that they have to provide us insight into with Cost Explorer, but maybe adding resource ID functionality.


Again, would I pay for this? I actually might because getting up to a resource ID level for S3 bucket resource IDs in Cost Explorer would allow me to not have to set up the CUR, find a place to put it, load it into Athena, turn on the Glue crawler, fire up Tableau and point it at it, just to get this report. So, that would be huge for me is that one feature. Is almost like parity, right, between the CUR and Cost Explorer. And—now, I see I would pay for that, but I’d just make Corey pay for everything, right? [laugh].


Jesse: [laugh]. Well, until Corey publicly drags us on social media for making him pay for things, and then, you know, we have to have a conversation.


Pete: Yes, that's true. [laugh]. Awesome, well, it's a amazing start to 2021 because we made it; we're here. We're in the future. It feels good. Hopefully, the future will continue progressing in this way. But as always, it's a lot of fun to chat about some of the things that would be cool to see in 2021 or future years because they'll announce some of these things in preview, and maybe we'll see them in 2022, who knows? But Amy, thank you again for joining us and pulling your list together. Jesse, it's always a pleasure.


Thank you all for listening. If you enjoyed this podcast, please go to lastweekandaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekandaws.com/review, give it a five-star rating on your podcast platform of choice and tell us what is your hope for the future with AWS. Thank you.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 01 Jan 2021 03:00:00 -0800
Counting Twitter Followers over Time, the Corey Quinn Way

Want to give your ears a break and read this as an article? You’re looking for this link.

Sponsors

Never miss an episode



Help the show



What's Corey up to?

Wed, 30 Dec 2020 03:00:00 -0800
Amazon Chat Slapfight
AWS Morning Brief for the week of December 28, 2020 with Corey Quinn.
Mon, 28 Dec 2020 03:00:00 -0800
AWS Wishlist and Chrismahanukwanzakah Part 1

Links

Transcript
Corey: This episode is sponsored in part by our friends at Linode. You might be familiar with Linode; they’ve been around for almost 20 years. They offer Cloud in a way that makes sense rather than a way that is actively ridiculous by trying to throw everything at a wall and see what sticks. Their pricing winds up being a lot more transparent—not to mention lower—their performance kicks the crap out of most other things in this space, and—my personal favorite—whenever you call them for support, you’ll get a human who’s empowered to fix whatever it is that’s giving you trouble. Visit linode.com/screaminginthecloud to learn more, and get $100 in credit to kick the tires. That’s linode.com/screaminginthecloud.


Pete: Hello and welcome to AWS Morning Brief. I am Pete Cheslock. I'm joined yet again with Jesse DeRose. We are also excited to re-invite recurring guest for number two, Amy Negrette. Say hello, Amy.


Amy: Hello.


Pete: So, we are here. This is Christmas. Or should I say Christmahanukwanza.


Jesse: So, close. That works.


Pete: So, close. But it's the Christmahanukwanza episode—Hanu—hanukwanza—


Jesse: Christmashanukwanzika.


Pete: And if you thought Hanukkah was spelled a bunch of different ways, Christmahanukwanza is spelled a lot of different ways. And we are here to talk about the #amazonwishlist, which is honestly one of my favorite hashtags to follow on Twitter—#awswishlist. It is pretty popular, it's heavily used.


Jesse: It was actually so heavily used that they made a specific @awswishlist account, basically, specifically to follow a lot of these hashtags, and to re-highlight a lot of these hashtags, especially when some of the wishes are actually fulfilled.


Pete: Yeah, I think it's a great thing, and if I was an Amazon product manager, I would love this too because just talk about making my job a lot easier, I guess.


Jesse: One thing that I do want to call out, I was looking through a number of the tweets going around for the hashtag#awswishlist, and I noticed that there was some of the responses from AWS folks, which one I'd love to say thank you, AWS for actually taking this seriously and actually responding to folks in conversation on Twitter for these wishlist items. There was one that I found where the person directed the original poster to an AWS support page, which was basically AWS’s, like, ‘Contact Us’ page. And the Contact Us page basically said, “Hey, if you have some questions, here's what you should do. I have some questions that could help improve an AWS product or service, how can I send feedback to AWS?” And all the answers were, “Click the feedback button on the page that you're on, either in the AWS console or the AWS documentation, or contact AWS support directly.” So, close—


Pete: Did you just tell me to go F myself there, Jesse? [laugh].


Jesse: [laugh]. I didn't maybe say it in so many words, but I think I did.


Amy: I absolutely love it when a support page says, “Maybe you should just do it yourself.” And I'm like, “Well if I did, I probably wouldn't have been here in the first place.”


Pete: Exactly. So, what we decided to do, what we thought would be kind of fun, is to troll through the Twitter #awswishlist hashtag and take a look at what people were saying, especially because it's a lot busier around the pre to current re:Invent time. And so independently each of us put together a list of things that—I mean, at least I could speak for myself—I thought were interesting, or things that I thought would be cool to have. And yeah, we're just going to talk about them and see from there. So, we'll include a link to each of these tweets in the [00:04:18 show notes] so you can check them out, and also so you can see the conversation on them.


What was also cool, I just want to call out is that some of these that we saw on there, at least that I saw have been resolved by re:Invent time. One was AWS CloudShell that was announced recently at re:Invent, someone was saying I want is this AWS CloudShell thing because other vendors have this: Azure has this, Google has this. So, here's a scenario where Amazon was catching up. So, I thought that was pretty cool to see. So, I'm going to kick it off because, whatever, I'm here, and I got my list in front of me.


So, this is actually related to the CloudShell one, which I thought was interesting. So, there was some conversation online about CloudShell, and this is maybe potentially allowing people to remove the need of having a bastion host, which, how cool is that you don't have to run those anymore?


Jesse: Oh, yeah.


Pete: And so there was a question around, “Well, does my identity get a home directory?” Which sounds like the answer was “Yes.” But the question mark there had to do when using AWS SSO because it has to do with the IAM principle, it's what comes back from the sts get-caller-identity. So, if you are using one of the different Federation technologies, your actual identity could be different for each one. And so that's a wishlist item that I could definitely be on board with because if you're dealing with IAM roles or Federation, and your home directory is never the same, that can be kind of annoying.


Jesse: I cannot tell you how many times I have downloaded a file or put a file somewhere on a bastion host, gone away to a different project, come back to it, or SSH’ed into the same bastion host and wondered why it wasn't there anymore, only to realize that I was on a different bastion host in a different environment, or that the data had been purged every so often for security or cleaning purposes. I would absolutely love clean roles and just really, really well defined boundaries on this. Coming from somebody who uses different AWS accounts on a regular basis for the different clients that we work with, I would just love to see this really kind of clean structure of AWS, IAM usage, and user management and security.


Pete: And, Jesse, we saw similar issues, I believe, when we were playing around with QuickSight, and Federation, and IAM so—


Jesse: Oh, yes.


Pete: Hopefully that gets a little bit fixed up. But anyway, I thought that was a pretty interesting one. Amy, what did you find in your discovery of the Amazon wishlist hashtag?


Amy: I did find one for X-Ray support in API Gateway HTTP API. Again, one of the worst, longest names of any service, and EventBridge, which surprisingly, one that this hasn't happened yet, but two, [00:07:12 unintelligible] for me is kind of a double-edged sword where it's one of those services that everyone needs, but also the UI for it—like all Amazon monitoring—leaves more to be desired if you are a human being with eyes. But also it’s a weirdly priced service that seems to be more expensive as a native service than the competitors that are third party services.


Jesse: One of the things that I'm always fascinated by with AWS is the model of buying—or I should say acquiring new companies to compete or building the same thing internally to compete. And this is going to be an interesting one because I feel like if there are other competitors out there that are doing the same thing for a cheaper price, it's going to be difficult for AWS to continue to sell this X-Ray service at its current structure. But I'll also say that I think that it's so built into a lot of existing AWS native solutions that I feel like that's where they have the advantage. But again, every step of the way, AWS just has this opportunity to make things better, especially when you want to talk about the pillars of excellence. I think that this is a great one to focus on to say, these services do not work as well, the user experience is not as clean as it could be.


Pete: Well, there's always room for improvement. So, what did you find, Jesse? What was one of your favorite wishlist items you found?


Jesse: I'm gonna go big here. I think my number one that I had—and to clarify, these are not in a particular order, but I found this one first, so I was thinking about this one first as I started thinking about wishlist items. My biggest one is I would love to see a simplified reservation options across AWS services. So, for example, like, right now, AWS provides savings plans, which support EC2, Fargate Lambda, and Fargate for EKS. And I'll give them that, like, some of those services didn't have other reservation options in the past before this.


But then you've also got reserved instances for EC2, RDS ElasticCache, ElasticSearch, Redshift. Then DynamoDB, reservations are reserved capacity. Elemental MediaConvert has reserved queues. And there's all sorts of different places where all of these reservations can be managed. I would love to see all of these manageable in a single place, ideally the billing console, and then maybe also built out across—or accessible across the different services where these resources and reservations live.


But just one easy way that all of these things are consolidated in the same umbrella term so that ultimately when we're talking about reservations, we can talk about it broadly across all AWS resources, or all AWS services, not just the ones that are covered by a certain type of reservation.


Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.


Pete: Yeah, I mean, reservations exist. I think your point, Elemental MediaConvert, probably most people have never heard of that service. And now you're like, “Wait, there's a reservation option for that?” So, hopefully, what I would love to see as part of that, my wishlist item would be, bring it into Amazon Organizations at that level, and give me that really high-level view over everything. That would be [kiss], you know, a Chef kiss, right?


Jesse: [laugh]. I love that idea.


Pete: So, you came in with this big huge—you're just scaring all these Amazon engineers, with oh, my God, how terrible as that is, I'm going to come in with one that I saw, which I feel this one really strong, I think it's still probably a massive engineering undertaking, which is stripping the trailing whitespace from any text that you copy and paste into the different console search [laugh] boxes and stuff. Amazon has done this miraculous job of putting a little ‘copy to clipboard’ button next to some things, but it's not in all the things so sometimes you copy, and then it gets two tabs’ worth of information and maybe a special character. So, if you can't fix that part of it, maybe just fix—just strip it when I pasted it. I don't know, just ignore it—in some place, just ignore that. So, again, that's probably a bigger engineering effort than yours, Jesse. [laugh].


Jesse: Yeah, I cannot tell you how excited I was to see the copy to clipboard button in a bunch of different places on the console, and how much headache that has relieved me from, so I would love to see more of that with making it easier to copy and paste, ARNs and Instance IDs and other things across the console.


Pete: Amy, I am sure you've been burned just like the rest of us with that one but other than stripping the whitespace from the copying and pasting, what do you think would be a great Amazon wishlist item?


Amy: Speaking of copying and pasting, I saw one—this is less about actually having to do it, and more of knowing why AWS won't let you do it. When you try to delete anything on AWS, they give you a text box with different words in it. Not CAPTCHA level different words, but it's always some format of ‘delete me,’ ‘remove me,’ ‘delete this resource,’ sometimes it's the actual resource ID. And one of the requests was to standardize them because if you're trying to maintain this a lot, and you are trying to keep your resources down, if you don't have an automated way of doing it, you are slowly losing your willpower delete after delete, trying to remember what all of these words are. And then you end up glossing over it anyway, instead of what it was obviously trying to do, which was give you a little speed bump, so you don't delete stuff on accident.


There has to be a better way of doing that. I don't know what it is yet, but it is infuriating to go to one resource deleted and then go to its neighbor resource, or something it's attached to and try to delete that, only to have the phrase be different enough to give you a headache.


Pete: It's like an infomercial: “There has to be a better way.”


Jesse: [laugh]. I will add to that, that when I have to manually delete resources from the AWS console, it drives me crazy when it's maybe like an EC2 instance that has multiple EBS volumes attached, and it'll say, “Hey, if you delete this instance, we're automatically going to delete all of these EBS volumes as well.” And then it lists out the Volume IDs, but I'm not gonna be able to remember what these things are based on Volume IDs like I need the volume name, or maybe some tags or some kind of descriptive information that will help me better confirm like, “Oh, yeah, I don't need those volumes. That's fine.” Rather than just a bunch of Volume IDs that at that point, YOLO. Let's delete them. Who cares? I mean, it was easier when the Volume IDs and the Instance IDs were—what—eight characters before they got expanded.


Amy: 5000 characters.


Pete: Yeah. You know, I could remember those, like, “Oh yeah, Instance ID, like, I-A7D? I know that one. That's my friend. That's my NFS server. I need that server.” Right? I could always remember those.


Jesse: I'm going to go for another really big one here. I'm going to call out—okay, actually you know what? That's a lie. It's not a big one; it’s probably a sprawling one, but I wouldn't say it's super-technically difficult. I would love a very clear, easy to understand document describing which AWS usage is taggable, and which AWS usage is not taggable.


When I'm talking about taggable, I'm talking specifically about user-defined cost allocation tags. This is something that I will say from experience has driven me crazy because I have built a number of lists based on what is and is not taggable and created Cost Explorer reports off of this information, but I would just love one single definitive AWS document to rule them all—I mean, who wouldn’t—talking about how you can easily manage your cost attribution and your unit economics based on which AWS resources and which usage types are taggable and which ones aren’t.


Pete: Yeah, I mean, that's a really great point because, as we know—because we deal with this every day—to capture resources that are not taggable, the only way to do that is to have segmented accounts, which could be a massive engineering effort; it could maybe overcomplicate your network architecture. And so, I dream for that world; that's the future I want.


Well, that was a pretty amazing part one of the #awswishlist items because there are Twelve Days of Christmas, there's going to be at least two days of #awswishlist items. So, stay tuned for part two next week, where we will invite Amy back again, to go through the rest of our wishlist items, and because it will be the new year—goodbye 2020 and welcome 2021—we'll maybe share some of our hope for the future. So, we'll hope to see you next time for that.


If you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice and tell us what is your #awswishlist. Thank you.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 25 Dec 2020 03:00:00 -0800
EBS Volumes

Want to give your ears a break and read this as an article? You’re looking for this link.

Sponsors

Never miss an episode



Help the show



What's Corey up to?

Wed, 23 Dec 2020 03:00:00 -0800
Some Cloud Shells Take Years to Form
AWS Morning Brief for the week of December 21, 2020 with Corey Quinn.
Mon, 21 Dec 2020 03:00:00 -0800
Ask a Cloud Economist: Cost Attribution in AWS

Links

Transcript
Corey: This episode is sponsored in part by our friends at Linode. You might be familiar with Linode; they’ve been around for almost 20 years. They offer Cloud in a way that makes sense rather than a way that is actively ridiculous by trying to throw everything at a wall and see what sticks. Their pricing winds up being a lot more transparent—not to mention lower—their performance kicks the crap out of most other things in this space, and—my personal favorite—whenever you call them for support, you’ll get a human who’s empowered to fix whatever it is that’s giving you trouble. Visit linode.com/screaminginthecloud to learn more, and get $100 in credit to kick the tires. That’s linode.com/screaminginthecloud.


Pete: Hello, and welcome to AWS Morning Brief. I am Pete Cheslock.


Jesse: And I'm Jesse DeRose.


Pete: We're back again, and we're here to answer an audience question. So, every once in a while people tweet at us—you can tweet me @petecheslock. Jesse, what is your Twitter handle?


Jesse: @Jessie_DeRose.


Pete: Yeah, mine is just petecheslock. I do feel bad for the other Pete Cheslock, who actually does live in Boston as well because taking all of his profile names.


Jesse: You should change yours to @therealpetecheslock, or he should change his to @therealpetecheslock, and then it'll just be an ongoing escalating battle.


Pete: That's very true. So, occasionally on the Twitters, we get questions asked of whatever around Amazon cost management, things like that. And we wanted to actually take this opportunity to answer one of the more interesting questions that we received. Because granted, sometimes we get questions and they're pretty boring, so we don't answer them. We just focus on the fun ones, [laugh]—


Jesse: [laugh].


Pete: —selfishly, but we got this question that was really interesting. It had to do with someone who is essentially starting over within Amazon Web Services, meaning they were going to be redeploying their application into a series of new AWS accounts. And they asked us, “What are the most recent best practices—” I hate that term, but the important things you should do and consider when you're deploying into Amazon, into AWS. And we kind of sat back, we thought to ourselves, “Wow, how often does someone have that opportunity?” Right, Jesse?


Jesse: Yeah. Not in any of my experience has that happened for me. I'm very, very envious of these people.


Pete: Yeah, I had that opportunity one time, where we were essentially doing that, like, net-new, starting over. But this was years ago, where there wasn't a lot of insight into this, and we didn't have the features like we have today where Amazon organizations—AWS Organizations—allows such an easy way to create accounts and get started with multiple accounts. So, anyway, we want to take this opportunity to talk about what we believe and what we see as the things that you should focus on, what you should optimize for when getting started, when creating, kind of, net-new in AWS.


Jesse: Yeah, there's a lot of different things that you can optimize for in AWS, and it really depends on what your business goals are; what do you ultimately want to accomplish when you are deploying your application into the cloud? But one of the big ones that we see, selfishly, here at Duckbill Group is cost optimization. And so we wanted to talk a little bit more about cost allocation and cost attribution—which are essentially the same thing, we may use the terms interchangeably in this conversation—to talk about how you can think about cost attribution, why you should think about cost attribution and some of the best ways to go about implementing that in AWS as you're building these new accounts, this new space.


Pete: Yeah, and that being said, I really like people to really think when they create these things. Again, what are you optimizing for? Some people might say, “Oh, well, we want to optimize for security.” And that's great. You absolutely should do that.


Jesse: Sure.


Pete: Security is a first principle, something to absolutely focus on. But what if I told you that the other, probably, most important thing in AWS is—and something if you're not doing it today, you're going to be asked to do it in the future—is accurate cost attribution. And what if you could do both highly secure accounts, and segment based on security, but also get this cost attribution? That is, I think, what we're going to dive into today.


Jesse: Yeah, I think that there's a lot of big conversations around engineers, and multiple other teams when you start talking about the DevOps movements, the DevSecOps movements, all these movements of the software engineers who are actually writing the code and the engineers or the operations folks who are—maybe—managing the infrastructure, maybe deploying the code, maybe the software engineers are deploying the code, it really depends on your team setup. But there's this, kind of, idea that the engineering teams that are working with this code, and then there's all these other teams in the company that have other things that are their top priority, and starting to bridge that gap to have conversations with finance to better understand what do they need to know from you about how you're spending money in AWS, and security who wants to better understand are we patched for the upcoming audit? Are we compliant based on these terms? It's really important to start thinking about how you optimize in AWS based on those ideas, those conversations with other teams. So, that's kind of ultimately what I'm thinking about, specifically, today, specifically about the conversation between finance and engineering and talking about cost attribution.


Pete: But Jesse, aren't tags supposed to solve all of my problems when it comes to cost allocation?


Jesse: [laugh]. Oh, I wish. They are supposed to. There's that whole idea of ‘set it and forget it,’ there's a big movement of ‘tag it and forget it,’ and as much as I want to believe in that, it’s unfortunately just not true. Like, tagging is definitely a first step, but it goes so much further than tagging and I think that's one of the big things that a lot of folks miss or don't think about when they're talking about tagging and cost attribution.


Pete: If you loved it, you would have put a tag on it.


Jesse: [laugh].


Pete: But really, while tagging is an important thing to do, and we've seen some of our clients, their tagging percentages can be upwards of 90 percent, which is herculean in ability and effort to reach that level of coverage, but even then, getting that last 5 to 10 percent in many cases could be actually impossible to do because there can be a series of spend within Amazon which is just untaggable, or at least untaggable in a realistic way. And that's where multiple accounts can really help your business help you break out those costs in a really, really clean way.


Jesse: Yeah, I think that's one thing that's really important to think about because there's always going to be unattributable or untaggable spend in any AWS account, especially in a shared account where you have multiple teams deploying resources, multiple products sharing the same resources or sharing the same space. So, this is where we start talking about things that finance might want to know about in terms of how much is it actually costing us to run your different products? How much is it actually costing us to run these various different microservices of a given product?


Pete: Yeah. Like, services within a broader product, like service profitability within maybe a large SaaS product, knowing that level of insight, right?


Jesse: Absolutely. And then you can start thinking about not only what is it costing me, but then maybe you can start thinking about charging that back to the actual teams that are spending the money on those resources. And I get that a lot of people probably cringe when they hear me say that, so I don't want to say that that's an immediate first step that you should take, but start thinking about the budget of each component of your larger microservice or each component of your product if you've got a large SaaS product. Think about, how are you budgeting for future spend to better understand where can you spend more money? Where can you spend less money? Do you have the funds to hire another employee?


Maybe there's some optimization work that you want to focus on that will free up funds to hire another employee, or to look at other things. Start having that conversation with your finance team to better understand what are the things that are really important to them, and how can you help them answer their questions and reach their goals? And that's ultimately where tagging resources is a very, very important first step to look at all this spend from the perspective of the team or the product. But again, this is not the only step; you should also be looking at splitting up the spend into separate linked accounts.


Pete: Yeah, so most folks getting started on Amazon take their resources and put them into, maybe, a single account with multiple VPCs—those VPCs could be broken out by Dev and Prod, or maybe they're broken out by availability zone—or maybe you create an account for Prod and an account for Dev, then that's your separation. Some clients of ours and some companies out there, they're collecting accounts based on acquisition. So, maybe it's different global business units have various accounts related to it. But if you are starting from scratch, if you are implementing something brand new, we actually recommend a much more intentional planned approach. But there's three key things to remember when you do this: within all of AWS, the account itself is really the only hard security boundary, all the way down to the root level access that can be controlled with hard tokens.


That individual Amazon account is that hard security boundary. The next important thing is that accounts are free. It's free software, they're just giving it away. And you can create these very easily with AWS Organizations. And finally, tooling exists and is mature, both, again, native and third-party tooling to make multi-account management very easy, very straightforward in a way that just did not exist, God, even what, four to five years ago. I mean, it's truly amazing.


Jesse: Yeah, that's when I especially want to call out because AWS Organizations has come a very long way in terms of helping an organization manage multiple AWS accounts, not just in terms of creating them, but also applying standardized policies and security policies across these accounts so that you can literally create cookiecutter AWS accounts that are already adhering to the best practices that you as an organization need to follow.


Pete: Yeah. So, when you are creating your accounts, there's a series of accounts that, largely, everyone's going to need, how much you use each of these accounts is really dependent on your architecture or organization. But at the very least, you got to create one account; you got to get started. And this is your master—or your, what we would call ‘root’ account, moving away from an antiquated terminology of the master/slave—we would call it the root account. It’s the core account you create. This is your first one; this contains the AWS Organization’s configuration.


The Master Payer Account lives here as well—that's a Amazon wording there. And this is, again, all org-wide control policies, and ideally, all of your savings plan and reserved instances would be managed within this account as well. And that allows for those savings, those commitments to be spread across all accounts underneath it. And that allows you to really ensure there's as little waste as possible when making those accounts. Go ahead, Jesse. Yeah.


Jesse: This is also a really great opportunity to give finance access to this root account in a very limited way so that they can look at billing information. So, they can look at Cost Explorer, for example, and see spend trends across all of your sub-accounts; they can see what spend looks like across all of your products, across all of your resources, and get a better understanding of the total AWS spend, broken down by product, or by microservice, or by team, or by some other business unit or entity that you decide.


Pete: Exactly. The next account that you want to create is essentially your security account. This would be all your security and logging tooling, like your SIEM or your SIM—whichever way you want to pronounce it—your vulnerability audit scanners. But most importantly, it's your CloudTrail logs. So, CloudTrail is something that you should absolutely enable in every single account, but you want to have those go to this account; if there was only one thing your “Security”—air-quotes—account was used for, its storage of your CloudTrail logs in a safe, immutable place in that location with a restricted group of people who have access to it. That's really important.


Over time, as your usage grows, maybe you're using Transit Gateway, for example, to connect all of these accounts so that they egress their network traffic through security. Maybe you're running various proxies within the security account. Again, it's a very restricted centralized place that, again, at the very least, no matter what you're doing, if you are deploying anything to Amazon, should contain just your CloudTrail logs.


Jesse: And I think this also gets back to the point that I made earlier about AWS Organizations and applying sort of a cookiecutter template to all of the accounts that you create, after you've created this security account, you can apply this cookiecutter template to all of your other sub-accounts in such a way that they all automatically point and send all of this information back to this security account. And again, as Pete mentioned, only the security team will have access to this account, or only your specific SIM group, or your specific security group will have access to this account. And so when you talk about things like Platform as a Service, the security team doesn't need to have access to every single individual AWS account, but as long as they have access to this central hub, they can see what's happening in all the other accounts and they can have conversations with the teams who are deploying resources in those other accounts by knowing that maybe there's a potential vulnerability in a specific business unit account, or a specific team account, or a specific environment account. And if they look at the unit that that account is for, or if they look at the user-defined cost allocation tags that are associated with the resources that are part of this vulnerability, they have a way to talk to somebody and actually get this issue resolved.


Corey: This episode is sponsored by our friends at New Relic. If you’re like most environments, you probably have an incredibly complicated architecture, which means that monitoring it is going to take a dozen different tools. And then we get into the advanced stuff. We all have been there and know that pain, or will learn it shortly, and New Relic wants to change that. They’ve designed everything you need in one platform with pricing that’s simple and straightforward, and that means no more counting hosts. You also can get one user and a hundred gigabytes a month, totally free. To learn more, visit newrelic.com. Observability made simple.


Pete: Yeah, the next account, this could have a few different names: maybe it's your Tools account, maybe it's your shared services account, but this is your org-wide tooling, shared tooling, CI and CD systems, your version control systems, source control systems, config management systems, things like that. And these would be the things that, maybe, are required to support all these other services. This could even be your monitoring infrastructure might live here. And these would be the things that maybe you take this spend—talking about cost allocation, just like the security account—you take the spend that's within here and as one client of mine referred to the best analogy that you give the peanut butter schmear across all of the other—


Jesse: [laugh].


Jesse: —from a cost perspective, you just spread that peanut butter, that you spread that spend across all the other accounts. So, if you had five other specific service accounts, you could divide this spend by five. And that's how you essentially show back or charge back that cost to those accounts. But most importantly, this is going to improve your security posture by having, again, a central place with all of the company assets. And again, AMI repositories, container repositories, source control, everything like that.


You can restrict this down, you can give the level access, you can peer with these other accounts, a lot of interesting stuff you can do. But largely creating a place where you're capturing, you’re bucketing spend—even if it's not tagged within this shared services kind of group—to be allocated at a later time.


Jesse: This is where I feel like I need a ‘plus-one emoji’ audio bite so that I could just add that. I don't have anything specific to add to here. And—


Pete: Oh yeah.


Jesse: Yeah, [laugh].


Pete: Like the old—is that from the 90s or whatever? Yeah. [laugh].


Jesse: Yeah, yeah, yeah, absolutely. And so, like, that—just a huge plus-one to making this account and put all of your tools together in a shared resource space.


Pete: Yeah, the last account that we'll talk about is essentially an identity account: user access control. This is really terminating your access control in this account, whether it's an SSO solution or Federation solution, or you're just creating individual user accounts or IAM roles that people are authenticating against. Again, centralize it all in one place. You should not have user accounts existing in all of your accounts. Amazon, IAM, Federation has largely solved that issue.


You can use the AWS SSO services, there's third-party services that are great, like Okta, and OneLogin. Again, if there's one thing you should not be doing, it's creating individual user accounts in your individual Amazon accounts. That will become, essentially, impossible to manage, so don't even start there. Start off with a some type of Federation, or some type of Single Sign On, it will make your lives so much better.


Jesse: And again, going back to the conversation about AWS Organizations and creating these cookiecutter accounts, one of the things that you can add to these templated AWS accounts that you later create for your different brands, or teams, or products, you can add all of this federated access built-in, baked in when you create the account so automatically you have all of the specific leverage points that you need for federated access without going into every account individually and creating a bunch of roles and enabling a bunch of feature flags or services.


Pete: So, those are generally the core accounts you're going to create, not all-inclusive. There might be things that are specific to your business or organization. But, Jesse, what do we do now? Okay, we've got our core accounts. I've got my SaaS app here; where do I go with this?


Jesse: This is always the age-old question where, “What do I do now?” And the answer is my favorite answer of, “Well, it depends.” It really depends on what you ultimately want to do with your different business units. It depends on what your business units are. And those business units could be as broad as different products, or particularly different actual business units or parts of the organization that are using AWS in different ways, or it could be as small as different components of a single product, like different microservices or different teams that are leveraging AWS in different ways.


Start thinking about how does your organization look at the different components of your product or products, and start breaking that down. Is it a single brand with a single product? Is it multiple brands with multiple products? Start breaking down that tree, start breaking down that list into all the smaller components, and look at that as the list of accounts that you ultimately want to create. So, for example, you may have two products and you'll want to create different accounts for those two products, number one, but number two, you might also want to create different accounts for different deployment environments.


So, for example, you may want Product One to have a development account and a separate production account. You may want Product Two to have a separate development account and production account. So, start thinking about that kind of binary tree, almost, which I shudder to even bring up in this day and age. But start thinking about that kind of tree of different business units, and identities, and ways that you slice and dice the components of your engineering organization, and start thinking about that for your linked accounts.


Pete: Yeah, I mean, I think you could also, too, take those—maybe their brand accounts, their product accounts—definitely don't get too granular here because you could really, kind of, shoot yourself in the foot going out in the future. But you want to have some level of separation here to make it easy to not only attribute those costs but to, kind of, separate those workloads. And maybe you say to yourself, “Well, I just really need to know that Prod spend is different than Dev spend.” Then great. Then that's the solution for your organization.


Again, none of these really preclude you from moving and being more segmented in the future; you could create, let's say, a new product initiative, deploy that into a brand new account. Now that one brand new product is in its own Amazon account for cost attribution reasons. I can't tell you how many times I've been asked at certain stages of a startup’s life that I've worked at, “What is the cost of certain services to us?” Or, “How much is a certain client costing us?” And when you get these questions, then you need to start diving into, kind of, per product spend.


But one thing that I think more businesses should be doing is to answer that question is, what is the cost of these features? Being able to go to your product teams, or being able to arm your product teams with this information for products and features within an application, for example, to help them drive their decisions on, “Well, if this one feature is used by 1 percent of our clients but represents 30 percent of our spend,” that should be a decision point, a decision process to, “Do we keep it? Do we not keep it? Do we refactor this?” Do something with it, right? But if that spend is just locked up, what do you do? You don't even know that that's happening.


All right. Well, Jesse, I think we gave a good intro into how to create your accounts. What if I'm on Amazon? I mean, is it worthwhile for me to start pursuing some of these strategies, even if I have a ton of technical debt in my infrastructure?


Jesse: Yeah, I absolutely think it is. I think that there are very small steps that you can make over time that will absolutely help along this route, even if you have existing technical debt, even if you have other things that you are concerned about, even if you are in a quote-unquote, ‘brownfield space,’ there is still absolutely opportunities to start moving the needle on this work with cost attribution and cost allocation.


Pete: Yeah. So, don't think just because you've been on Amazon Web Services for a decade and you're kind of locked in your ways, that you can't improve what you have. You can create new accounts, you can start creating new accounts, and start migrating services into those accounts. Again, it's not going to happen overnight, but a concerted investment over a period of time will absolutely net benefits in the future.


So, do you have a question for Jesse and myself to try to answer or at least stumble our way through a podcast episode? You can now ask us via a website at lastweekinaws.com/qa. And QA is not short for ‘quality assurance,’ as you might think, that is short for ‘question and answer.’ But because we are all former sysadmins and technical operators, we're pretty lazy and we don't like to type out very long things. So, we've just shortened it to ‘QA.’ So, go to lastweekinaws.com/qa enter in your question, but more specifically, what kind of questions do we want to focus on, Jesse?


Jesse: Yeah, so obviously, this is a podcast about AWS, so the question should be scoped to AWS. Really past that, we try to focus on questions related to cost optimization and optimization of AWS services in general. I particularly have a very soft spot in my heart for a lot of the more qualitative conversations, but I think a lot of the content that we cover on this podcast is more quantitative, so I think any questions related to AWS architecture, things that you might want us to cover in the future, specific questions around? “How do I…” with cost allocation or cost attribution or anything related to cost optimization, we are happy to answer.


Pete: Absolutely. So, head over there, fill out the form, ask us a question, and yeah, we'll do our best to answer that on one of our future podcasts. If you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice, and tell us what your most impressive account structure at your business. Do you have just one account for everything? There's nothing wrong with that. Thanks again. Bye.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 18 Dec 2020 03:00:00 -0800
Is ECS Deprecated?

Want to give your ears a break and read this as an article? You’re looking for this link.

Sponsors

Never miss an episode



Help the show



What's Corey up to?

Wed, 16 Dec 2020 03:00:00 -0800
SageMaker SageFactory
AWS Morning Brief for the week of December 14, 2020 with Corey Quinn.
Mon, 14 Dec 2020 03:00:00 -0800
The Kinesis Outage

Links

Transcript
Corey: This episode is sponsored in part by our friends at Linode. You might be familiar with Linode; they’ve been around for almost 20 years. They offer Cloud in a way that makes sense rather than a way that is actively ridiculous by trying to throw everything at a wall and see what sticks. Their pricing winds up being a lot more transparent—not to mention lower—their performance kicks the crap out of most other things in this space, and—my personal favorite—whenever you call them for support, you’ll get a human who’s empowered to fix whatever it is that’s giving you trouble. Visit linode.com/screaminginthecloud to learn more, and get $100 in credit to kick the tires. That’s linode.com/screaminginthecloud.


Pete: Hello, everyone. Welcome to the AWS Morning Brief. It's Pete Cheslock again—


Jesse: And Jesse DeRose.


Pete: We are back to talk about ‘The Kinesis Outage.’


Jesse: [singing] bom bom bum.


Pete: So, at this point, as you're listening to this, it's been a couple of weeks since the Kinesis outage has happened, and I'm sure there are many, many armchair sysadmins out there speculating at all the reasons why Amazon should not have had this outage. And guess what? You have two more system administrators here to armchair quarterback this as well.


Jesse: We are happy to discuss what happened, why it happened. I will try to put on my best announcer voice, but I think I normally fall more into the golf announcer voice than the football announcer voice, so I'm not really sure if that's going to play as well into our story here.


Pete: It's going, it's going, it's gone.


Jesse: It’s—and it's just down. It's down—


Pete: It's just—


Jesse: —and it's gone.


Pete: No, but seriously, we're not critiquing it. That is not the purpose of this talk today. We're not critiquing the outage because you should never critique other people's outages; never throw shade at another person's outage. That's not only crazy to do because you have no context into their world. It's just, it's not nice either, so just try to be nice out there.


Jesse: Yeah, nobody wants to get critiqued when their company has an outage and when they're under pressure to fix something. So, we're not here to do that. We don't want to point any fingers. We're not blaming anyone. We just want to talk about what happened because honestly, it's a fascinating, complex conversation.


Pete: It is so fascinating and honestly, loved the detail, a far cry from the early years of Amazon outages that were just, “We had a small percentage of instances have some issues.” This was very detailed. This gave out a lot of information. And the other thing too is that, when it comes to critiquing outages, you have to imagine that there are unlikely to be more than a handful of people even inside Amazon Web Services that fully understand the scope of the size and the interactions of all these different services. There may not even be a single person who truly understands how these dozens of services interact with each other.


I mean, it takes teams and teams of people working together to build these things and to have these understandings. So, that being said, let's dive in. So, the Wednesday before Thanksgiving, Kinesis decided to take off early. You know, long weekend coming up, right? But really, what happened was is that there was an addition of capacity to Kinesis, and it caused it to hit an operating system limit causing an outage.


But interestingly enough—and what we'll talk about today—are the interesting and downstream effects that occurred via CloudWatch, Cognito, even the status page, and the Personal Health Dashboard. I mean, that's a really interesting contributing factor or a correlating outage. I don't know the words here, but it's interesting to hear that both CloudWatch goes down and the Personal Health Dashboard goes down.


Jesse: That's when somebody from the product side says, “Oh, that's a feature, definitely not a bug.”


Pete: But the outage to CloudWatch then even affected some of the downstream services to CloudWatch—such as Lambda—which also included auto-scaling events. It even included EventBridge, which was impacted, and that even caused some ECS and EKS delays with provisioning new clusters and scaling of existing clusters.


Jesse: So, right out of the bat, I just want to say huge kudos to AWS for dogfooding all of their services within AWS itself: not just providing the services to its customers, but actually using Kinesis internally for other things like CloudWatch and Cognito. They called that out in the write-up and said, “Kinesis is leveraged for CloudWatch, and Cognito, and for other things, for various different use cases.” That's fantastic. That's definitely what you want from your service provider.


Pete: Yeah, I mean, it's a little amazing to hear, and also a little terrifying, that all of these services are built based on all of these other services. So, again, the complexity of the dependencies is pretty dramatic. But at the end of the day, it's still software underneath it; it's still humans. And I don't want to say that I am happy that Amazon had this outage at all, but watching a company of this stature, of this operational expertise, have an outage, it's kind of like watching the Masters when Tiger Woods duffs one into the water or something like that. It's just—it's a good reminder that—listen, we're all human, we're all working under largely the same constraints, and this stuff happens to everyone; no one is immune.


Jesse: And I think it's also a really great opportunity—after the write-up is released—to see how the Masters go about doing what they do. Because everybody at some point is going to have to troubleshoot some kind of technology problem, and we get to see firsthand from this, how they go about troubleshooting these technology problems.


Pete: Exactly. So, of course, one of the first things that I saw everywhere is everyone is, on mass, moving off of Amazon, right? They had an outage, so we're just going to turn off all our servers and just move over to GCP, or Azure, right?


Jesse: Because GCP is a hundred percent uptime. Azure is a hundred percent uptime. They're never going to have any kind of outages like this. Google would never do something to maybe turn off a service, or sunset something.


Pete: Yeah, exactly. So, with the whole talk about hybrid-cloud and multi-cloud strategies, you got to know that there's a whole slew of people out there, probably some executive at some business, who says, “Well, we need to engineer for this type of durability, this type of thing to happen again,” but could you even imagine the complexity of just the authentication systems that exist differently between two systems. Like IAM, and one and whatever's in GCP. But then, if you've built for Kinesis, and then using, like, Amazon, or Google's Pub/Sub, building for the interoperability, like, just from a technical perspective, I would love to see someone do that. And then please do a conference talk so I can listen to it because that sounds technologically impressive.


Jesse: Absolutely. And full disclosure, both Pete and I and the folks at Duckbill Group have mixed feelings on a multi-cloud strategy, but the point that we want to, especially, stress here is that there are places where a multi-cloud strategy may be beneficial for your company, for a business use case, and we're not trying to say that's wrong. But running into an outage with AWS, running into an outage with the cloud provider that has the largest share of the industry isn't necessarily the right move. Don't just move because you ran into an outage. Move purposefully, or develop a multi-cloud strategy purposefully for a business use case, not because you don't want outages because let's be honest: outages are going to happen, no matter which service provider you use.


Pete: Yeah, exactly. So, let's dive into the details. So, back on Wednesday, November 25, Kinesis, like I said, decided to take off for the long weekend. But the trigger for this event was a small addition of capacity that was added about 2:44 a.m. PST, and it took about an hour for that to complete.


This was specifically the frontend systems that handle authentication, throttling, and request routing. And again, definitely read through this whole outline of the outage because it gives tremendous detail into more about these frontend systems, why it takes so long for them to come on board, really, just all of the complexity involved here. It's really fascinating. So, on adding that capacity, they talked about that servers that are operating members of the fleet, they have to learn of the new servers joining and they'll establish threads to those other systems. And they mentioned it would take up to an hour for existing frontend members of the fleet to learn these new participants.


So, about an hour and a half after bringing that capacity online, they started getting alerts from Kinesis, and they thought—like many would think—it was likely related to the new capacity, but they were unsure because some of the errors that they were seeing just didn't correlate to that. But they still decided to start removing the new capacity anyway, right? That's a pretty logical first step right, Jesse? Undo the thing I did when you start getting alerts?


Jesse: Absolutely. And I think that is the logical first step. And I think it's even more important to point out that those alarms started an hour after they deployed the services and that's really, really tough because when you deploy something, you want to know immediately if it fails. The fact that they didn't start seeing alerts until an hour later, gives any engineer on call that kind of sinking feeling of dread of, “Well, I thought everything was good to go, and I went back to sleep after running five or ten minutes worth of tests or looking at the data. But clearly, it's not, and now I need to dive into this more deeply.”


Pete: Yeah, so about two and a half hours later, after those alerts went off, they narrowed things down, and they believed that a full restart of those frontend systems would be involved. And now, Amazon does a really good job to explain why this is. Again, go and read the full outline, we're just summarizing here. And I think, for anyone who's ever run any sort of large distributed database at scale knows that adding and removing capacity—or just restarting in general—can be really challenging and time-consuming because you have to check, ensure consistency along the way. They even pointed out that they were worried about systems being overloaded, and because they were being overloaded, they might be marked as unhealthy, and then those would be removed from the pool as well. So, that's a really interesting caveat here when they talk about what it would take to actually resolve this issue.


Jesse: Yeah, my heart goes out to the engineers who were diagnosing this outage because diagnosing an outage is stressful enough, but diagnosing an outage with multiple potential influencing causes, and different metrics and alerts, your brain is already working so, so hard to keep up, and it's not great for your mental health. This is probably why we see a lot of burnout because there's, there's a lot of different potential influencing causes for this kind of outage, and when you're running any kind of distributed database at scale, it's really tough to really clearly and easily nail down one thing that caused a service outage, quickly and easily. I also really want to quickly call out, AWS mentioned that this process of restarting the frontend servers was a long and careful process. And I have to admit, I always cringe a little bit whenever an organization says that a technical workflow is going to be a long and careful process because it points out a system that is extremely susceptible to human error or negative environmental forces. That's a big business risk.


And it doesn't mean that there's anything in the process that is wrong, or incorrect, but it points out a great opportunity for improvement. It points to a place where maybe more testing needs to happen. Or maybe this kind of process needs to be broken down into smaller, more manageable processes that have been tested and can be either automated or can be tested on a more regular basis to make sure that when this type of issue comes up again, it's able to be handled much more quickly and efficiently.


Pete: Yeah, I mean, I've been on the brunt side of the database restarting game adding capacity to systems, and it just pushes something over the limit. You're not sure what but, I mean, it's like, reading through this has reminded me of so many outages that I've had to deal with. But distributed databases are hard. Distributed databases at Amazon scale is next. Level hard. I mean, you're dealing with edge cases that most people are unlikely to see.


Jesse: And again, this is why my heart goes out to all the engineers who worked on this, not just managing these systems day to day in general, but who were part of troubleshooting and managing this outage. That's a lot of work.


Pete: So, about half an hour after where we last stopped when they had, kind of, narrowed this down—so this is now four and a half hours after the initial alarms fired—they got to identify the contributing factor—I'm not going to say ‘root cause.’ They say ‘root cause’ enough for everyone in that document. But they found that when adding the new systems, they hit a thread limit on the operating system. And so, it was this classic Linux limits that are set historically low for decades of past. I can't tell you how many times I've hit random Linux thread limits, and file count limits, and socket limits, and—I mean, it's just—it’s—


Jesse: It's annoying.


Pete: It's annoying, yeah. And one thing I really want to call out is that when they talked about these number of files, the adding of those systems—because they required them to talk to other ones—it was increasing the number of sockets, number of network open connections, like that makes a ton of sense, hearing them kind of explained this out. But again, think about just how—not simple but, really, how simple of a problem this was it was just this largely artificial limit set by the operating system from who knows when, long ago.


Jesse: Yeah, and the fact that they found this particular contributing factor, four and a half hours after the first alarm went off. That's a huge shout-out to those engineers who were heads down, doing exploratory work for that long. And similar to Pete, I've been on the receiving end of this where you find the contributing factor—or one of the contributing factors—and you think to yourself, “Oh, thank God. Now I know what went wrong.” But it takes time to get there, and with these engineers, who were looking at multiple different streams of metrics, and alerts, and errors, to be able to find something, four and a half hours later, I know there's a lot of HugOps going around on Twitter when this is all happening, but I just want to plus one that because huge, huge props to the people who were focused on this for that entire period of time. That is a lot of time for your brain to be under this cognitive load, to be stressed out, trying to resolve this outage.


Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.


Pete: Now, when they identified that they were hitting this limit when they removed those extra systems because they didn't want to just blindly increase that limit without understanding the impact. And that was a phenomenally smart move. I have been at too many places where we hit an artificial limit, we'll just go and fire the command that will increase that limit to some unreasonably impossibly high number. Because that's totally not setting a ticking time bomb for your future.


Jesse: [laugh]. Yeah. What could go wrong? What could possibly go wrong?


Pete: [laugh]. As it turns out, like, yeah sure, you won't hit the filesystem limit anymore, but then you’re going to have some subtle memory leak. It'll fail in just some new and interesting way.


Jesse: “That's Future Me's problem.”


Pete: Yeah, exactly. Or if you're like most people, and you bounce from job every two years, it's truly somebody else’s [00:17:44 crosstalk]. [laugh]. You're like two companies away at that point.


Jesse: Oh, god, it's so true.


Pete: Oh, I’m not saying that from experience or anything. So, long term, one of the solutions here is they're actually going to move Kinesis to fewer, yet larger hosts, so that they can run less, which solves the scaling challenge. And again, I really like this solution as well because they know how to operate Kinesis at a certain server count size. They know how the discovery happens, how the frontend systems talk to each other at a certain size. By reducing those number of systems to larger hosts, they kind of give themselves an ability to scale further because they know how to scale to a certain server count if that's their key point.


By reducing that number down and using those bigger hosts, they can always then scale back up again knowing where their limits are by, again, not increasing that artificial limit on the Linux operating system, or—assuming they use Linux there. Because Amazon Linux—by not increasing that limit, they're not introducing a new unknown variable on how the system will react. They can leave the limit in place and just change into a model that they know should have well. So, it really limits that unknown consequence of changing that limit.


Jesse: Yeah, I think this is a really great way to look at it because they are able to see that there are multiple different levers that they could pull and manipulate in order to resolve this problem, but rather than tweaking a lever that is potentially going to open up a bunch of new problems down the line, they are specifically saying, “No, we're not going to touch that. We’re going to keep these OS limits in place, and we're specifically just going to move to systems that allow us to run more threads concurrently.” Which I think is a really great way to look at this.


Pete: So, there were some bugs that were found along the way. Obviously, it wasn't just Kinesis that had this problem; we mentioned this before. The first thing, I think, that was mentioned in the outline was that there was a bug that was surfaced in Cognito. Cognito uses Kinesis for analyzing usage patterns and access patterns for their customers, and it was having issues because it could not send that data off to Kinesis. But then also there were issues with CloudWatch metrics: they were being buffered locally in, actually, various services or just dropped entirely.


And that then causes anything that's dependent on those metrics to no longer work. And that's potentially a pretty huge issue. Like Auto Scaling, if you had an Auto Scaling event based on metrics that never arrived, that could have and very likely caused many of the outages for consumers of these services.


Jesse: And this is part of what makes this outage so fascinating to me because we are talking about a very complex system here that has multiple moving parts, multiple services were involved, not just services from the AWS perspective, but services within the different systems of the Kinesis service, and one of the most important things is graceful degradation of these services so that in the future, we don't run into these issues as hard. So, maybe in the future, the Cognito service is able to continue to operate, even when it's seeing errors from the Kinesis API, or these other services, these AWS services are able to continue functioning at some degraded level even when they're seeing errors from upstream services that they depend on. And that's really important because that's one of the things that ultimately became bugs that were highlighted here, but also future improvements that we want to call out that are really great ways to think about how can we make this better in the future, not just in terms of preventing this from happening again, but how can we minimize this kind of impact in the future?


Pete: Exactly. And in some cases, this buffering of metrics that some services had, like Lambda, actually caused memory contention, until engineers identified and resolved it. In some cases, they actually added additional buffers—they even mentioned adding three hours’ worth of storage into CloudWatch’s local metric store that would then allow for services like Auto Scaling and such to be able to operate. I think, one change that they made, which, again, I kind of laugh at, just because it's so real.


Again, you want to think Amazon is this whole other level, and in scale they are, but they’re the same humans as we are doing the same type of work, and the change they did was to migrate CloudWatch into a separate partitioned frontend fleet, which is just incredibly common and oftentimes is the inevitable result of an outage. Take the most critical thing off of the, quote, “shared cluster” and move it into somewhere that's a little bit separate. I can't tell you how many times I've had outages where the answer was, move that really noisy client off of our Elasticsearch cluster and they get their own.


Jesse: Yeah. If they are going to be super noisy, let them have their own space to be noisy so that they're not impacting everybody else who needs the same services. If there's one client who specifically is going to be noisy, or needy, or high-compute-intensive, you put them in their own cluster, and maybe give them more compute resources so that they ultimately are able to do what they need to do without impacting everybody else.


Pete: Exactly. So, onto the summary. Obviously, we both have our hot takes, and we'll greet you with these hot topics now. But I think at the high level, as always, more monitoring, more alerting; these are things that are always needed. It's super hard to know what to monitor in advance, the greater observability that you have in your environment, that ability to have insight into what's happening to be storing that data somewhere that's accessible—of course, if CloudWatch goes down, then maybe you have some problems there, so—but having more of this data because even though you may not know what to monitor, trying to monitor as much as you are financially and technologically able to, it allows you to have that data there for answering the unknown-unknowns. This is a common topic in the observability world is trying to find those unknown-unknowns—those outliers—to get a quicker answer and a quicker resolution to those problems.


Jesse: Yeah, I think that unknown-unknowns are extremely important to think about, especially in observability, as you mentioned, Pete. If I could go back and teach my younger self anything, I would say, “Just be mindful that there are going to be unknown-unknowns.” And I think being mindful of that is critical because there are definitely folks in the monitoring space who believe that you need to monitor everything and have all the metrics so that you can always have the data that you need, but I think it's less about that and more about understanding what you are aware of and understanding some things you aren't aware of, or at least, understanding that there are things that you aren't aware of that could potentially come up and bite you in the butt and that you need to be able to have contingency plans for that.


Pete: Yeah, exactly. Wow. Well, this was a fascinating post-mortem outline that Amazon wrote up. I highly recommend that you all read through it. I think it's just great to see this level of detail. Outages are painful for everyone, but the amount of detail they gave that really explain the world in which they were operating and debugging this within, I just thought it was incredibly fascinating to get that insight, kind of, behind the curtain.


Jesse: Yeah, we'll throw the link to the outage right up in the [00:25:16 show notes], but I also wanted to highlight an [article by Ryan Frantz], who talked about this outage through the lens of a Donella Meadows’ Systems Thinking and Practice. Kinesis is a really complex system in its own right. Even if this outage didn't impact any other systems, even if it was just Kinesis itself that was experiencing problems, the retrospective of just Kinesis itself having these problems is a fantastic example of complex systems failing. But then, when you add in all of these other strands to the web, that make the system even larger, even more complex—you have not just the microservices within Kinesis, but you have, now, other AWS services that rely on Kinesis—you've got lots of other moving parts to worry about and coordinate. And it's not just about the contributing factors or the quote-unquote, “root cause,” but about how all of these different components in the larger system can still function in some kind of degraded mode when the services that they rely on are unavailable. How can we keep the entire service web, so to speak, available and online, even when some of the components of the service web might be weaker, or some of the components may be gone altogether?


Pete: Yeah, exactly. All of this just speaks for that the level of complexity that we operate within is growing at an unknown rate over the past many decades. I mean, things are just so much more complex, and especially with the rise of microservices, it gets harder and harder to identify dependencies. You know, you see those Death Star graphs as well. It's crazy.


Awesome. I think that does it for us. If you have enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us what your most painful outage was. Thanks again.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 11 Dec 2020 03:00:00 -0800
The Google Disease Afflicting AWS

Want to give your ears a break and read this as an article? You’re looking for this link.

Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 09 Dec 2020 03:00:00 -0800
Hit by the Conference Trainium
AWS Morning Brief for the week of December 7, 2020 with Corey Quinn.
Mon, 07 Dec 2020 03:00:00 -0800
AWS S3 Storage Lens: The Best Service Not Announced at AWS Storage Day

Links

  • Follow Last Week In AWS on Twitter

Transcript
Corey: This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the Enterprise (not the starship). On-prem security doesn’t translate well to cloud or multi-cloud environments, and that’s not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35 percent faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com/trial.


Pete: Hello, welcome to AWS Morning Brief. I am Pete Cheslock, and I am here yet again with Jesse DeRose.


Jesse: Hello.


Pete: We here to talk about the best service announced not during AWS Storage Day 2020.


Jesse: So, close.


Pete: So, close, though. It was announced a few days after, and that is the AWS S3 Storage Lens service, which I think I've got that naming right. I know sometimes it's ‘AWS thing,’ sometimes it's ‘Amazon thing,’ and to be honest, I never know which is which.


Jesse: Yeah.


Pete: AWS S3 Storage Lens is honestly one of the best new services that I've seen out, released thus far. I guess we're still pre-re:Invent announcements in a lot of this stuff. But what it is is a—from their site it says, “S3 Storage Lens delivers organization-wide visibility into object storage usage, activity trends,” blah, blah, blah, blah, blah, marketing speak. Basically, it allows you to get a view of your S3 usage across accounts. Which, that's mindblowing, right?


Jesse: Yeah. This feature has so much potential; I'm really excited to see where they go with it.


Pete: Yeah. And so when I first saw this blog post on Amazon’s site talking about it, my mind just started going crazy because again, we work in Duckbill Group as cloud economists with a lot of different clients, and because Amazon organizations may be the reason why, made it very easy to spin up new accounts, maybe also the adage, the design principle of creating many Amazon accounts to kind of segment workloads or to provide you to—segment your workloads in a way for cost reasoning or security reasons. But all of those things—somewhat related, somewhat not—have caused a lot of our clients to have lots of Amazon accounts. I mean, you could see hundreds, in some cases, of Amazon accounts.


And the issue that I've always kind of had, and especially an issue we deal with in helping our clients analyze their costs and optimize their costs is how do you aggregate S3 usage? Because S3 is normally in the top five of services that we see in usage, how do you pull that together? And I guess we do that a lot of different ways. Jesse, maybe you can chat a little bit about what are some of the ways that we try to analyze this spend currently?


Jesse: Yeah. Pete, I think I'm really excited about this feature because AWS already offers aggregate looks at metrics for other top services by spend. Like, for EC2, you've got Compute Optimizer. We don't have anything for RDS yet, but I feel like that might be not far off, given Compute Optimizer’s existence. And we already have other tools that allow you to look across multiple accounts to look at metrics, especially if you're looking at Cost Explorer, for example, you can see metrics across multiple accounts, you can see spend across multiple accounts.


So, I feel like this makes sense. I'm really excited to see that you can look at all of your S3 storage metrics in one place because right now, the only way that we're able to get any kind of representation of S3 usage is through Cost Explorer. And there are ways that you can go about filtering and slicing that data to get usage information and certain metrics, slicing and dicing on different filters for accounts and cost allocation tags, but it's all at the bucket level, or at the usage level, and if you really want to dig in deeper, you don't have a lot of options.


Pete: Yeah, it's a service that they're operating on your behalf. So, your only insight is what they give you insight into. Maybe some of that is CloudWatch metrics, there's obviously the S3 storage analytics that can give you some idea in your storage—based on access—that can help you kind of optimize, but nothing really again at the—ability to see it across multiple accounts is I think, really the big game-changer too.


Jesse: And I think what's really amazing here is that the majority of metrics that they're offering are free. And we'll get into that in a minute, but I'm really impressed that so many of these metrics are shared free of charge. You just have to turn it on. And then you have access to all of this great information that you can work with.


Pete: Yeah. I think that's a great point that we haven't mentioned yet, that this is—the basic form of this is free. And the metrics that you can get are pretty useful in the free tier. Also, this is actually something that is turned on in your account right now. If you have an Amazon account, go into S3, it's actually under S3, it'll be on the left-hand column—at least it should be unless they go move stuff around—but you'll see a drop-down for Storage Lens, and you'll see an option for dashboards.


And when you go into the dashboards, there will be a default dashboard already pre-configured with the free metrics enabled for your account. Now, that could be super helpful if, let's say, you just have one account, you can get some real good high-level metrics around your storage based on bucket. You can go into that dashboard and really quickly see total storage across all your buckets. You can see trend analysis with, day-by-day, week-by-week change comparison, how are things growing. There was one thing that I saw that I was really blown away by because this is something we deal with a lot is they have broken the metrics out in kind of a high-level summary, focusing on data protection, like being able to see data percentage replicated or encrypted, but also based on cost efficiency, too, being able to see if you have versioning enabled, obviously, there's a cost for that.


How many old versions of this thing do you have, but also incomplete multipart uploads? That is potentially a large and in many ways, super hidden cost for some users of Amazon S3. If you are uploading a multipart file, and it fails, it lives in this purgatory, storage purgatory, where you're charged for it, but you may not see it in an obvious way.


Jesse: And we see that with a lot of our clients who have multipart uploads and end up with these incomplete multipart uploads that just take up space. There's no clear metrics right now, prior to Storage Lens, that say, here's all of this stale multi-part upload usage that you're paying for, that's effectively just taking up wasted space. But now we have metrics for that; now we have information that can clearly tell us where they are, how much space they're taking, and you can actually do something about it.


Pete: Right. Yeah, it gives you this intelligence that you can act upon. To talk about those metrics, since we're kind of on that stage, when I went into that default dashboard, I obviously started looking through what kind of metrics. And there's a whole lot of them, and these are included in the Amazon documentation, which is linked within the Storage Lens in S3, but you can see things such as average object size, and object count, total storage. Those things can be helpful, depending on maybe you want to see kind of where you're spending in which buckets.


Maybe your top-level spend, but you want to know how much is in certain buckets. Being able to see current and percentage of current version storage: how much data is an old version, which can really stack up the charges. Like I mentioned, multipart uploads. And then even dive into things around replication, how much your data is replicated and replicating? And the encryption as well. Like—


Jesse: Yeah, that one I'm really excited about because—


Pete: Yeah. Like, what percentage of your data is encrypted?


Jesse: Yeah. I feel like this is something that so many companies harp on—or especially security teams harp on to get all of their data encrypted end-to-end, everywhere in their application ecosystem. So, to be able to see at a glance, you have 80 percent of your buckets are encrypted, or 80 percent of your S3 objects are encrypted, you have a clear picture of how much of your data is protected the way that you expect it to be protected, how much more you have to go, and it's all in one pane of glass, essentially. As much as I hate to use that phrase, it really is this clean dashboard that gives you all this information at a glance.


Pete: Yeah, exactly. And when I logged in, checked out the default dashboard, I was, like, thinking—and this is maybe just a confusion on my part. Also, I didn't read the directions. Classic move; I just went right in and clicked on it—it does call itself default account dashboard, not default organizational dashboard.


Jesse: Yeah.


Pete: And so when I clicked that, I was like, wow, this storage is kind of small because obviously, it was just for this account. So, what I did is, let's go create an all Duckbill account view; I wanted to look at a dashboard for all of those, and to do that you actually need to go in enable an Amazon organizational setting where it's authorizing S3 Storage Lens to access the organizations, so that you can create those organization level dashboards. That was amazingly easy to do: you click a box, and you tick a thing, and hit save. You're not dealing with IAM. I didn't go into IAM once for any of this. So, kudos on—


Jesse: Which is huge.


Pete: Kudos on making that setup super easy. And so I went to go create a dashboard for all my buckets, gave it a name. Again, kind of read through as you go create these things. I didn't read anything, and I ran into some issues, one of which is there's a region for your dashboard. And that's important because if you create a dashboard in a region, and then want to dump data to a separate bucket, it actually told us that you needed to create the bucket in the same region as the dashboard.


So, that's one of the cool features as well of Storage Lens is the ability to output the metrics that it has for you into S3. So, now you can consume this into whatever you're using, like if you want to consume it into your other monitoring services. And I'm sure there's going to be a variety of third-party integrations for this kind of data. As you go and create the dashboard, you can limit it to maybe all of the accounts, certain accounts, including, excluding certain things. But then you get into the section on metrics collection.


And there's free metrics that's the default. But you can also enable the advanced metrics and recommendations. There is a price for this; that's not free. And interestingly enough, that is actually something that is twice as expensive as the current pricing for the storage analytics, I believe, and—I believe it's 20 cents per million objects monitored. So, not a lot of people may know how many objects they have, but here's the beauty: now you do.


You can turn on the free metrics, figure out how much you have, and actually get an accurate cost idea before you turn it on. That's pretty awesome and rare in the Amazon world.


Jesse: Yeah. One other thing that I do want to call out is that this feature is enabled in your individual accounts already—or has been enabled in your individual accounts already, but if you do want to turn it on for, let's say, the entire organization or some subset of your accounts, once you turn it on and it starts gathering metrics, it will only start gathering metrics across whatever subset of accounts or buckets you give it at the time that you turn it on. So, effectively when we turned it on, it started giving us metrics across all of our linked accounts at that time, but wouldn't go any further back. So, similar to S3 Analytics, where you turn it on, and then it starts giving you metrics based on your usage patterns over the first 30, or 60, or 90 days that you haven't turned on. Similar case here, where you will only see metrics across multiple accounts or across an entire organization once you turn it on and effectively tell AWS that you want to gather all of that data in one place. It won't automatically have that data and store that data historically for you.


Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.


Pete: And what do you get for your 20 cents per million objects monitored? You get a lot of activity metrics: get requests, put requests, lists, posts, deletes, et cetera. If you're using S3 Select, you'll be able to see details around selecting request, and amount scanned, and bytes downloaded, uploaded. All kinds of things like that, super helpful. One thing it doesn't have, though, and I'm hoping these services get merged because, honestly, Storage Lens, I want the cross-account view of my storage analytics data.


So, I want the view of how often files are being accessed in this same view. And I really hope they incorporate it in. And to be honest, if it's 10 cents for just Analytics, but 20 cents for Storage Lens. I would pay more for Storage Lens if it gave me that insight, the storage class analytics because then I can optimize for not only requests, maybe identify where I need CloudFront in front of one of my buckets, but also at tier as well.


Jesse: Yeah, absolutely. That's something that I think is going to be really impactful for—or is a really great use case for the advanced metrics because ultimately, I think that most people can get away with just the free tier of metrics that are available today free of charge, you don't need to enable the advanced metrics. But if you really want to go the extra mile to really start looking at how can I optimize my applications’ ability to use data and read and write data in S3, those advanced metrics will absolutely help.


Pete: But here's the beauty. Turn on the free mode—well, it's all really on by default, but if you want to turn on cross-account, figure out how many objects you have that you want to monitor that you want to get those insights on, turn on for just those specific buckets or items.


Jesse: Absolutely.


Pete: And because you're going to know how many objects you have, you'll know the cost impact in advance. You also don't need to have it on all the time. You can turn it on for a month; you can turn it on for a period of time, get the insight you need, make the recommendations, and move on. One very critical point, though, that I will call out here because we obsess about Amazon pricing and bills. Is that a valid assessment, Jesse?


Jesse: Yeah. I think that's an understatement.


Pete: To the point where we spend more time than really any human should reading Amazon documentation around pricing because every time something new comes out, we obviously want to know how it's priced because people are going to ask us how it's priced and we want to have that answer. So, this is, again, it's 20 cents per million objects monitored per month; straightforward. And there's a pricing model that already exists for this with Storage Analytics, S3 Analytics for Storage Class, so we understand that model. But in the pricing guide on S3, there is a line that I called out here, which is interesting because you can enable or disable a dashboard, which I thought was weird.


Jesse: Yeah.


Pete: Why is that there? And I now know why it's there. And this is the line; it says, “For S3 Storage Lens advanced metrics and recommendations, you will be charged object monitoring fees for each Storage Lens dashboard used. The Storage Lens advanced metrics and recommendations pricing include all the stuff that you get: 50 month data retention, activity metrics, et cetera.”


What that means is, if you create a dashboard to monitor all of your accounts and all of your buckets, and you turn on advanced metrics, you will be charged 20 cents per million objects monitored. If you create a second dashboard doing the exact same thing, you will get charged an additional 20 cents. Your price will literally double. And they will keep doing that. You will keep getting charged 20 cents per million objects monitored per month, per dashboard.


Jesse: This feels like a very typical AWS move where they announce something really, really awesome, really, really cool, really, really exciting, but the pricing and the documentation doesn't quite clearly highlight those very sharp edge cases.


Pete: And we see it a lot with other services, that people have no idea why they're getting charged in certain ways. And it's simply because of the pricing being specific to something like this, like every dashboard, you create. Also too, if I create an all-account dashboard, in my top-level, kind of master payer account, let's say, and other people create dashboards at their maybe lower-level accounts aggregating that same data, again, you're going to get these additional charges. And so that's definitely something to keep in mind. It's a rough edge there. And it's something that you'll want to monitor for. Maybe they'll create S3 Storage Analytics Dashboard Systems Manager Cost Manager for us later. I don’t know.


Jesse: [laugh]. God Almighty, help us.


Pete: Don't actually do that Amazon; that was a joke. Don't create that service. But what's interesting is when we did create this dashboard, I was like, “Cool. I want to go look at it.” And we got an Amazon Detective pulled on us here, Jesse. What happened?


Jesse: Yeah, so as soon as we enabled this dashboard, we clicked into the dashboard to look at it, and it said, “Thanks for enabling me. I have to do some stuff behind the scenes. Come back in 48 hours.”


Pete: I mean, you can't be that upset about it, but it is still funny. At least Detective was like, “Come back in 14 days,” and we were like, “Okay.” [laugh].


Jesse: Well, yeah. And it was worse for AWS Detective because—or, excuse me, Amazon Detective because you were effectively paying for those 14 days or you are in part of a trial period for those 14 days, where you effectively couldn't do anything.


Pete: Yeah, exactly. Now, there was one again, great feature, the chef kiss feature, that when you create these metrics, whether they're free or paid, you can have them be exported to an S3 bucket into CSV format or Parquet format, and again, shout out for more Parquet storage because these metrics could potentially be pretty sizable, I guess, if you have a lot of data, just like everything. But also, if you have it in Parquet format on S3, you can immediately query that stuff with Athena in a super-easy way. But if that's just a little too advanced, which I get it; it's not the easiest to use, CSV is very flexible as well. And I think it’s, again, great that they're giving you insight into this data and then giving you the data for, then, you to do whatever, and maybe that is to consume into your third party metric system or consume into your own tool.


But there's still some questions, I think, that we're trying to figure out when it comes to pricing. There's some places where maybe the web is free, but the CLI isn’t.


Jesse: Yeah.


Pete: Is there an API to this? I don't know. We didn't have time to check that out.


Jesse: Yeah. I will say final thoughts for me, this is definitely awesome. I'm really excited that the free tier has so many amazing features and I'm really excited to dig in more to it. I would say go out, enable the free tier today—well, it’s already enabled for your individual accounts, but if you want to enable it across multiple accounts, if you've got multiple accounts, by all means do so. Again, it's free, and who doesn't love more data to make more data-driven decisions?


Pete: Let's be honest, AWS Storage Lens is the best new service that was not announced at AWS Storage Day. So, if you've enjoyed this podcast, please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice and tell us what interesting things you found with AWS Storage Lens.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 04 Dec 2020 03:00:00 -0800
The Most Under-Appreciated AWS Service

Want to give your ears a break and read this as an article? You’re looking for this link.


Sponsors

Never miss an episode



Help the show



What's Corey up to?

Wed, 02 Dec 2020 03:00:00 -0800
Punched in the Faith
AWS Morning Brief for the week of November 30, 2020 with Corey Quinn.
Mon, 30 Nov 2020 03:00:00 -0800
AWS Services for Thanksgiving Dinner

Links

  • Follow Last Week In AWS on Twitter

Transcript
Corey: This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the Enterprise (not the starship). On-prem security doesn’t translate well to cloud or multi-cloud environments, and that’s not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35 percent faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com/trial.


Pete: Hello, and welcome to AWS Morning Brief. I am Pete Cheslock, and I am here yet again with Jesse DeRose. Jesse, welcome back.


Jesse: Thanks for having me, Pete.


Pete: But it's not just the two of us. We have a very special guest: we are also joined with one of the newest hires to The Duckbill Group, Amy Negrette. Amy, hello.


Amy: Hello. And one might say the most special of guests; that person would be me.


Pete: The most special of guests.


Jesse: [laugh].


Pete: Well, we are pleased to have you. So, in honor of Thanksgiving—American Thanksgiving, for anyone outside of the United States, or who doesn't celebrate. But this is the American Thanksgiving holiday week. We wanted to take a little different approach to this week's episode. And Amy, you were the one who kind of came up with this idea, and so that's why we forced you to join us because—


Jesse: One of us. One of us.


Pete: [laugh]. Because you had such a good idea, and we wanted to make sure that we just pulled this together and really did a Thanksgiving theme to this podcast. So, I don't know about either of you, but my family has some very clear requirements about what dishes do and do not constitute Thanksgiving. And you can always expect the turkey and the stuffing. It's just not Thanksgiving without those core components.


Jesse: But then your cousin's boyfriend shows up with the candied vegetables that nobody asked to be candied. And, you know, you put a little bit on your plate because you want to be nice. You don't want to start World War III in the middle of Thanksgiving dinner. And you say, “Oh, yeah, this is good.” But then you're definitely giving those food scraps to the dog under the table and you don't go back for seconds.


Pete: I mean, a metric ton of sugar is probably the only way to make turnips taste good.


Jesse: Yeah.


Pete: So, with that in mind, we wanted to talk about what AWS services are those core services that you expect the customers kind of using to leverage the cloud, what services would kind of represent a Thanksgiving meal? Which ones constitute the turkey, or the stuffing, or the green bean casserole which, while preparing this, there seem to be some conflicting thoughts about the quality of a green bean casserole.


Jesse: There are some hot takes. Some hot, hot, hot takes in this discussion, putting this list together.


Pete: So, I'll kick us off with an easy, softball one because why not? But it's EC2, right? This is the turkey. It's the main course. And it's also what you'll be eating three to five times a day for every day for the next week or two because you're going to have a lot extra. It's just going to be around for a long time.


Jesse: Yeah, I feel like EC2 is one that you're going to get in some capacity, anywhere. Whether it is straight-up EC2 instances, whether it is Fargate, ECS, you're going to be using this compute resource in some capacity if you're using AWS. I don't think I know of any AWS customer that is not using some level of compute with EC2. Except for the few people who have managed to move entirely serverless to Lambda, which I am thoroughly impressed if you've been able to do that.


Pete: So, that is actually a great one which is Amy you do a lot with the serverless community. What do you think Lambda would be as a Thanksgiving side dish?


Amy: It is the canned cranberry sauce because everyone who I hear talk about it they seem to hate it, but I love it. I love not having to work for anything. It tastes the same and the sauce itself tastes like jelly and Lambda packages everything in a way where I don't have to deal with it, and to me that makes everything else super easy.


Pete: I think it's the slow oozing out of the can it does that really kind of makes me not want to like it, and those just too perfect ridges from the form of it. But I don't know what it is about it; when you just slice through that and put it on your plate, so delicious. And don't at me with your fancy homemade cranberry sauce, whatever. None of that can hold a candle.


So, I actually think Lambda is the special smoked turkey. Because it's a new trend. Lambda being in the new trend, serverless is a new trend. And of course, everyone who is doing a smoked turkey or has a smoker just can't stop talking about it, much like serverless. They just can't stop talking about it.


Jesse: Yeah. I mean, I think that ever since you bought your smoker, you have not stopped telling us all about the meats that you're smoking on a recurring basis.


Pete: I mean, I got a 16-pound turkey for $14, and I got turkey for days.


Jesse: What I love is that not only do you have a smoker and you talk about it, but you have a monitoring system that you set up so that you can monitor the temperature of the smoker at any given time.


Pete: I'm a bit of a Luddite at home. I don't like IoT powered anything because I think they're all generally terrible, but for some reason, yeah, my smoker has a little whatever, cellular—powered, connects to my wifi, but I can get to it from the app on my cell phone, can check the temperature of the turkey, out of the store running errands. “Oh, got to get home soon, my turkey’s almost done.”


Jesse: Okay, I’ve got another easy one for us. S3 is your mashed potatoes. It's good, it's on everyone's plate, there appears to always be an infinite amount of it. Everybody's going to want some. And most importantly, if you leave a bucket of it open overnight, you're going to regret it.


Pete: Yeah, that's going to turn to glue pretty fast, not Amazon Glue, which actually if we are going to talk about Amazon Glue and Lake Formation, and that weird amalgamation of Amazon services, we actually have one for that. This is something called the piecaken, which I had never heard about until I saw an Instagram ad because that's a thing. But a piecaken is a pecan pie—pecan or pecan? Let's not, do that.


Jesse: Oh, God, don't start.


Pete: Okay. Pumpkin pie, spice cake, and an apple pie filling. It's like three pies stacked into a cake. And that's what I think of when I think about the whole Lake Formation/Glue setup when you're trying to query or analyze your data lake.


Jesse: Yeah, my arteries just clogged hearing that description of all of those things combined together in one dessert.


Amy: It also takes several fully complete and difficult concepts, and then squishes them into one very complicated package.


Jesse: Yeah. Was that really necessary? Do we really need all of those things combined in one?


Pete: Well, if you cover it with buttercream frosting, anything's good. So, I think that's a lesson for the teams involved with Lake Formation; your next service needs to be something related to buttercream frosting. Amy, what do you think IAM would be because that is pretty ubiquitous in the Amazon world.


Amy: Since you have to put it in everything, it may as well be the gravy.


Jesse: Yeah. Now, with the gravy, are we talking, like, full-blown giblet gravy here?


Amy: I've only ever had giblet gravy. But in researching this, there's apparently more than one kind, and everyone decides to do it differently, which I'm guessing is where all the arguments on had to do it properly comes from.


Pete: So, it really is like IAM. All of the different authentication methods, and models, and you can give access keys or cross-account roles or some sort of federation. It really is like gravy. Jesse, CloudWatch is pretty heavily used, and we were talking about the monitoring of the turkey before. Where do you think CloudWatch would fit into the Thanksgiving AWS dinner table?


Jesse: Yeah, I feel like this one's kind of apropos to your smoked turkey. CloudWatch is definitely the deep fryer. Uncle Buck repeatedly says he knows how to use it, but ultimately ends up getting burned every single time. It doesn't matter how many times you claim what you're doing, you're always going to get burnt using this.


Pete: All right. So, I've got a good one because I am from the Midwest; I'm from Michigan, and I feel like Midwest folks take and create some pretty horrific Thanksgiving sides. I actually looked up before this the top of worst Thanksgiving sides—


Jesse: Oh no.


Pete: And pretty much all of them is what I would expect to see at one of my family’s Thanksgivings. So, I was a little angered by that, but one thing I did agree with which is something called ambrosia salad.


Jesse: Oh, bleh.


Pete: This is essentially a mixture of Cool Whip, which is a fake whipped cream, fruit, marshmallows, and, like, other stuff. And I think that ambrosia salad is pretty much like SimpleDB because why would you put these things together and offer it to someone else, just like SimpleDB? Why would you take and offer SimpleDB to someone else? Just, we should retire SimpleDB for the exact same reason we should retire ambrosia salad.


Jesse: I want to say that you're going to find some diehard ambrosia salad fans out there for sure. I want to say you're going to find some diehard SimpleDB fans out there, but I don't think they exist.


Pete: I'm waiting to get that one. I want someone to send me a message, say, “I use SimpleDB. I love SimpleDB.” I don't know, if you need a database, as you know, you should use Route 53 instead.


Jesse: Absolutely. This is probably another easy one. CloudTrail. CloudTrail is the pie. It's always ready 15 minutes after everything else is done.


Pete: Yeah. It does take some time for things to happen to actually show up in the CloudTrail. We could probably even make another case for the fact that if you eat too much pie, you're going to feel pretty terrible. And actually, if you create too many CloudTrail trails, you'll end up with a line item on your bill that is definitely not zero, and you're going to wonder why.


Jesse: Absolutely.


Pete: So, Amy, I think you had a pretty good one for one of the most—I don't know if it's the most obscure, but it's definitely an obscure Amazon service, the Quantum Ledger Database.


Amy: And I found this out, trying to research just the overview of Amazon services, that this is essentially a blockchain product because they list it as a blockchain product, but I have heard that it is both not a blockchain product, and useful for things that Amazon can't talk about. And my mother tries to convince me that sweet potatoes and squash are the same thing, and just in the same way that QLDB and blockchain aren't the same thing. A squash and a sweet potato are not the same thing. And just because they're same color, you cannot sneak it into my food.


Jesse: Absolutely not. I'm a big sweet potato fan and a big squash fan. But I can say those are two very distinctly different things.


Pete: I don't know. I don't eat any kind of vegetable at Thanksgiving that isn't with a marshmallow on top, I guess. We already talked about the ambrosia salad issue with my family. [laugh].


Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.


Jesse: Speaking of squash, to me VPC endpoints are like a roasted squash. You know that there's multiple varieties, and you've definitely already had this conversation before about which one’s which and which one you want, but you can never remember. You can never remember which one is which and which one you want after you've already bought it.


Pete: Yeah, there's too many squash names out there.


Jesse: Yeah.


Pete: I only know there's the acorn one because it looks like what it is, like, it looks like an acorn, and then spaghetti squash… maybe butternut? I don't know. But then you got like a whole family of squashes out there. I think one of my favorite services in Amazon because every once in a while we still see it in use. And it's a subset of EC2, so you'll have to just stretch your imagination a little bit because we did say EC2 was the turkey.


But specifically, there is a service within EC2—and this has no relation to the turkey, which is why you really need to suspend your disbelief here. But the m1.small. That is the first instance on EC2. That was from the EC2 beta, the m1.small was the first instance you could set up.


Admittedly, that server, which you could deploy to now, what, came out in 2006? That is a old server that you are running on. If you are deployed to an m1.small. There's no earthly reason that you should be on an m1.small, for the exact same reason that you should never ever make a dish that is called apples and onions.


And you can't really believe that that's a thing, just like you can't believe an m1.small is still a thing. And, just like with apples and onions, you're pretty sure someone who's trolling you when they say that that's an actual dish. I had to go and look it up. Amy, I think you were the one who told me about this dish, apples and onions. I couldn't believe it. And I even said like this dish must taste good if you, like, maybe caramelize it down a lot. I think, Amy, you said—


Amy: I was the one who told you that would be a way to do it, but that is not how it's made. It is made by pan-frying onions and apples together until both of them taste like onions.


Pete: Gross. So, I was horrified by this dish that I had just learned about. And you should not make that dish just like you should not use an m1.small.


Jesse: Please, please, please, please. Yeah, if apples and onions shows up at my Thanksgiving dinner, that is not going on my plate. But one thing that I can say will go on my plate will be dinner rolls. I love bread. I am kind of addicted to the bread week episodes of every season of The Great British Baking Show.


So, in honor of that, I would say any AWS Managed Services would be your dinner rolls. I'm personally fine with your store-bought rolls from the bakery or from your local grocer, but my sister refuses to touch any rolls unless she makes them herself.


Pete: Out in New England, we have Parker House Rolls that is the classic Thanksgiving—well, for any meal, really, they're just delicious, but—of course, they're delicious. It's just bread covered with butter and salt.


Jesse: Yeah, like, is it really necessary for you to make them yourself? Like I understand in the middle of a pandemic now, everybody's learned how to make their own bread, so maybe you think that you've got a leg up on the competition to make your own rolls this year, but you're already cooking a turkey, you're making sides, you're making desserts, do you really need to make your own rolls, or can you get away with the store-bought variety?


Pete: Ain't nobody got time for that.


Amy: If there's one food that you can really get away with just not making yourself, it's going to be the bread roll because you're going to be dipping that and other stuff anyway. And just like Services, if someone else can take one thing off your plate, why wouldn't you use it?


Pete: Absolutely.


Amy: It’s probably cheaper than all of that.


Pete: Absolutely. Hey, Amy, what do you think EFS would end up being on the Thanksgiving table?


Amy: I love the idea of green beans, and I love the idea of casseroles, but it is the one thing that they put together, it always feels like a bad idea. And it’s, why couldn't they be separate? They don't need to be this one thing that you made. And I always feel that apprehension, like, when AWS announced EFS for Lambda, that felt weird, and that felt wrong.


Jesse: Yeah. Like I feel the same way about EFS and green bean casserole that I do about QuickSight. I really want to like it. I really do. Like, I want to give it a chance, but it just never turns out the way I want it to.


Pete: I think you need more of those crunchy onions to put on top, and not the canned cream of mushroom soup that normally goes with the green bean casserole.


Jesse: Yeah, those crunchy onion… straws, I think they're called. Those are heaven.


Pete: I mean, I'll just eat those. I don't actually need the green beans or the soup. I think one of the ones that is maybe the most painful for a lot of people are the NAT Gateways. It seems… you know, that seems so easy, right? It’s NAT Gateway, you just spin it up and it just takes care of things.


But I think it's like starting an argument with your racist uncle. Like, it might seem a good idea; you might feel like the better person that you're going to finally call them out for their behavior, and so they will hopefully stop doing that or just not coming to Thanksgiving would equally be good. So, it seems like a good idea at the time, but you're going to end up spending the rest of your evening paying for it. And that's really the case with NAT Gateway: you're going to spend the rest of your time paying for that sucker.


Jesse: Yeah, absolutely. I always feel like I want to do this, but it's safer just to stay away.


Pete: So, Amy, Service Catalog; what do you think this would be for our Thanksgiving meal?


Amy: I brought this up because it is not heard of that I might have started a kitchen fire or two.


Pete: [laugh].


Amy: So, by the time you have burned everything that you're supposed to serve, maybe it's just some to order takeout. Maybe you should use the work someone else did, which is what Service Catalog is great for. You use solutions that's already approved by your company, you can just spin it up into your account and not have to deal with the impending figurative fires or in my case, maybe, little fires.


Pete: Oh, awesome. I have a recollection of my childhood when my sister set our microwave on fire, trying to microwave some chocolate sauce. So, I don't know why it's still burning my memory, but that fire was really—I was just happy it wasn't me.


Jesse: I've got one more. I would say that Beanstalk is going to be the cornbread, similar to the bread theme that I was talking about earlier. I always have high hopes for this, but 99 percent of the time it just lets me down.


Pete: I think I'm going to end this off with a fitting last one because we have all these delicious foods, but we need somewhere to put them, and that's on our fancy dinner china that you only use this one day of the year. And we think that the fancy dinner china is as close to the io1 EBS volumes. Or maybe is it io2 now, or even the faster ones? But this is essentially, you have this fancy dinner china—or io1, io2 volumes. So, you have no idea why someone chose the most expensive delivery method possible for your meal, and you're pretty sure that you're going to break something before the night is over. That is what I think of, whatever when someone mentioned the need of an io1 or an io2 volume for their application.


All right, well, that was a lot of dishes. I'm thoroughly hungry, which I really shouldn't be since Thanksgiving is over and I've eaten my fill.


Jesse: I want to say that I've eaten my fill. Well, let's be honest, I've eaten my fill and then some, at this point.


Pete: I still really want that piecaken.


Jesse: Oh, God. That one's all you. I will send you a piecaken for the holidays.


Pete: [laugh]. Please don’t. The last thing I need is a giant cake in my house because I would eat it all.


Jesse: Not only would you eat it, but then all of your kids would eat it as well, and I can only imagine their existing energy level plus that much sugar, running through your house.


Pete: It's not a good look at all. Well, Amy, thank you for joining us. Thank you for this great idea. I had a blast. I hope everyone else did as well, listening in.


If you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice, and tell us what your favorite Thanksgiving side dish is.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 27 Nov 2020 03:00:00 -0800
Secrets of AWS Contract Negotiation

Want to give your ears a break and read this as an article? You’re looking for this link.

Sponsors

Never miss an episode



Help the show



What's Corey up to?

Wed, 25 Nov 2020 03:00:00 -0800
GitHub's Basement
AWS Morning Brief for the week of November 23, 2020 with Corey Quinn.
Mon, 23 Nov 2020 03:00:00 -0800
AWS Storage Day 2020 Part 2

Links

  • Follow Last Week In AWS on Twitter

Transcript
Corey: Gravitational is now Teleport because when way more people have heard of your product than your company, maybe that’s a sign it’s a time to change your branding. Teleport enables engineers to quickly access any computing resource, anywhere on the planet. You know, like VPNs were supposed to do before we all started working from home, and the VPNs melted like glaciers. Teleport provides a unified access plane for developers and security professionals seeking to simplify secure access to servers, applications, and data across all of your environments without the bottleneck and management overhead of traditional VPNs. This feels to me like it’s a lot like the early days of HashiCorp’s Terraform. My gut tells me this is the sort of thing that’s going to transform how people access their cloud services and environments. To learn more, visit goteleport.com.


Pete: Hello, and welcome to AWS Morning Brief. I am Pete Cheslock, and I'm also here, again, with Jesse DeRose. Hey, Jesse, how's it going?


Jesse: Not too bad. Thanks for having me.


Pete: It is part two of AWS Storage Day. If you haven't had the chance to listen to last week's episode, Jesse and I dove into some of the new features really focusing on what we would think is the biggest feature of AWS Storage Day, which was the S3 Intelligent Tiering. Go back and listen to it if you didn't hear about it. But essentially, Amazon keeps extending out features [00:01:34 unintelligible] this Intelligent Tiering platform. And we talked a little bit about it last week.


But there were a lot of announcements as part of Storage Day, some pretty impressive, and some that were maybe a little underwhelming. We'll let you be the judge of that because some of these things could be incredibly important for you as—maybe—someone who operates on Amazon. So, now what we're going to do is we're going to dive into some of the other features, not only additional interesting S3 features, but there were a lot of new features announced around EBS, and EFS, and FSx, and all of the different ways that you can interact with AWS storage. I don't want to call it the biggest feature of this section because I think—let's be honest—they're all equally meh features, right, Jesse?


Jesse: Yeah.


Pete: I think that's going to be the common thread. Again, you might look at some of these features and go, “Finally, my life is so much better because they've announced this feature.” But I got to say, outside of Intelligent Tiering, Storage Day felt a little weak. But let's dive in anyway. S3 Replication; if you are replicating your data from one S3 bucket to another bucket, another region, which maybe you need to do for compliance reasons, disaster recovery reasons, some of the new features they added are around replication metrics and notifications.


Now, previously, these metrics and notifications were only available if you used the Time Control Replication, and that is a additional charge to get a predictable SLA for your data to be backed up. They made these metrics now available for anyone, so that's actually awesome to hear that they’ve really just extended that out and are kind of giving you something for free. Additionally, they now replicate delete markers, which I swear I looked at a bunch of documents to understand better what delete markers mean, and the best I got to it, I don't actually really understand the problem from before, other than as you delete a version of something in the source, the delete marker moves over. But then maybe the previous versions are in the destination. That was my gist of it, Jessie, what was your gist of that one?


Jesse: Yeah, I struggled a little bit with some of these previously because S3 replication always felt like this magical hand-wavy feature where you turned it on and then just waited, and eventually your objects would show up in your destination bucket or destination folder. But there wasn't really any clear path to what was going on behind the scenes. So, I'm really excited to see that now these metrics and notifications are available to everyone, not just to folks who were using the Replication Time Control feature, and allows everybody to more easily understand how their data is replicating between S3 buckets behind the scenes. So, I feel good about this one. I feel like this is definitely a step in the right direction. I'm really excited to see that this is now broadly available for everybody that's using S3. I think it will make using S3 Replication easier for a lot of folks who need it for business purposes or any other use case.


Pete: Yeah, absolutely. Another really awesome feature—I was actually excited for this because, of course, it must affect me in my day-to-day—S3 object ownership is now available for all the Amazon regions and amazingly supported by CloudFormation, which I feel like is always an afterthought. But what this allows you to do is you can use this feature too, when you upload files, it'll make sure that the ownership is assumed by the bucket you've uploaded it into. And so this gets around a lot of hairy issues that come into S3 permissioning, IAM permissioning. I mean, S3 permissioning, in general, predates IAM. I don't know how many people actually know that. And I think because of it, there are some really gnarly edge cases people run into, and this is a big problem solver.


Jesse: I am really, really excited about this feature release, I cannot say how many times we've run into this edge case with some of our internal tooling because we have effectively copied or synced data from a client's S3 bucket into our S3 bucket, and we don't gain ownership. And that becomes such a permissioning headache to be able to do anything with that data once we have it in our S3 bucket. So, I'm really excited to see that object ownership is now not only a first-class citizen but now is also built into and supported by AWS CloudFormation.


Pete: Yeah, absolutely. Another new feature: it has to do with Outpost actually, and you can get S3 on Outposts now which, that's truly amazing if you think about it. Now, I don't know of anyone who actually is using Outposts, and I would love to chat with someone who can, if they're even allowed to, or if they're stuck under an NDA. But what an Outpost allows you to do is essentially purchase a rack of AWS; it's a rack of servers and storage with Amazon APIs. If you really just think about that for a second, that's pretty impressive.


And if you are going to do hybrid cloud, and you have maybe some data locality requirements like you really need data in a specific location and that's not a region that Amazon supports, or you have data centers, or there's always some requirements, you can now get S3 on there. And they said that they can support 48 or 96 terabytes of S3 capacity per Outpost. What that actually means—like, is that a rack? Is that a whole rack? Is that just a single S3 configuration? Hard to really know. There's no API to go and provision an Outpost yet.


Jesse: Yeah, I'm really curious about this one to see how folks end up using it because I'm super excited that this is a feature that's now available. I love the idea of Outposts, even though it may not be a business use case for us internally. But I'm really curious to see how this changes the game in terms of object availability closer to the edge, closer to different locations for not just availability, but also for legal requirements for data storage around where you can or need to store data for compliance purposes.


Pete: Look. I'll be honest, I know that we will have made it as a business when we get an Outpost shipped to Corey’s house, so that we can put The Duckbill Group static website on an S3 bucket in Corey’s house, that's just how you've made it.


Jesse: But honestly, I have to say that I still prefer a Duckbill website status page that is manually updated by our intern, Fred, on an hourly basis. And so I don't know if we'll ever be able to move away from that model.


Pete: It's true. It is serverless, so we do like to be really progressive in our usage of serverless there. But I think that gist that Storage Day really talked about when it came to S3 is, to use the right storage class for your workload.


Jesse: Absolutely.


Pete: Amazon gives you so many different types of storage class tiers that it's almost criminal to just use S3 Standard for everything. We see it, right, Jesse? We see this all the time.


Jesse: Yeah, all the time. So, many folks turn on S3 storage, and put their objects in S3, and call it a day, and walk away. But there's so much functionality available beneath the surface in the different S3 tiers that can be leveraged. And we highly, highly, highly recommend finding the right tier for you. We highly recommend leveraging those tiers to optimize the amount of money you're saving for object storage and S3.


And to be clear, we understand that you may not be able to spend tons of time looking at the access patterns for your object data, so you may not want to spend the engineering overhead to move data into these different tiers, but ultimately then, you can turn on S3 Intelligent Tiering and which will automatically analyze those patterns for you, and move objects into the correct tiers accordingly. Or you could turn on S3 Analytics, which will also do all that work for you, and then make recommendations that you can choose to implement to move your S3 data into different tiers.


Pete: I think it's important to call out, too that it's a little surprising, but also maybe, I guess, I'm not surprised—and that's a weird statement to say—but when you use Amazon storage, you just push data and you forget about it. You don't have to think about it. You don't need an administrator. I mean, Jesse, you and I are both former sysadmins, we've managed NAS boxes, and SANs before. Can you imagine a world—I mean, you don't have to because we're there—that you would be storing petabytes of data without a administrator of some things in those systems in order to administer that infrastructure?


And I think that's kind of what we're seeing is that we're just seeing that lack of ownership and that no one really owns the S3 storage because you don't need it anymore. But because of that, people end up just saying, “Well, I’ll use Standard and call it a day.” And you don't really notice it until it becomes one of your top three line items. And you're like, “Whoa, how did our S3 storage approach six and seven figures? That seems like we should look into it.”


Jesse: Yeah, to me, I'm almost thinking of a metaphor related to a DBA, somebody who effectively will optimize your database usage and your database storage in such a way that you are optimizing your spend, which a lot of companies don't spend money on because they put a lot of things into storage in a database, and then they forget about it in a similar way that they push their object’s data into S3 and forget about it. I'm not recommending that you hire somebody specifically to do analysis of your S3 object data, but I think that it is something that is worth investigating. Even if all you do is turn on S3 Intelligent Tiering, or turn on S3 Analytics and then implement some of the lifecycle policies and recommendations that that feature makes, it's still worth your time because you will end up saving money optimizing that spend.


Pete: Yeah. And that's why we keep going back to Intelligent Tiering because it's the best way to save money in a passive—wait, you don't have to think about it. As your data gets older and unused, it automatically costs less. And that's what is so compelling of a service of a storage tier, that if you don't have the time, just leverage Intelligent Tiering. Again, there's some caveats, and we called that out in the last one, and you definitely should think about that, but when it comes to saving money, if you can just sit back and let it happen, there's not much better than that.


Jesse: Absolutely.


Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.


Pete: So, that pretty much is it for the S3 changes and the new features there. Of course, re:Invent is coming up soon, so we'll have to hear some of the other cool stuff that's coming out. But there were some EBS announcements, although only really one main one that I could see, and that is a cost savings. So, there's a sc1—the cold hard drive volume type—which is a really low-cost magnetic type storage that you can use for any of your sequential workloads, or Hadoop clusters, or log processing, just those large scale but sequential non-random I/O type of usage. And the price of those, they dropped it pretty dramatically for Amazon: 40 percent—four-zero percent. That—I'm trying to remember when we've seen some really big cost-cutting measures by Amazon. I mean, in the early days, it was like every year there was like another price cut, but I don't feel like we see this as much anymore.


Jesse: Now, I think that this is something that does not happen nearly as often as we would like it to. There's definitely price reductions over time as older hardware is phased out and newer hardware is phased in, but I can't remember the last time that we saw such a dramatic price cut and not just a dramatic price cut, but a dramatic price cut across all regions where EBS is available.


Pete: Yeah. And so begs the question, why was there such a big price cut? Was there a low adoption? Did people think it was too expensive? Did some large customer use a ton of these and then turn them off one day, and now you're getting some EBS on the cheap? We don't know.


Jesse: We don't know.


Pete: But I do remember, a very long time ago—it's probably one of my favorite EBS stories—was in the very early days of EBS, it was around 2010 or so, I remember working at a company; we had about 350 terabytes of unattached EBS storage, which I have heard, very informally, that that was a multiple percent of global EBS capacity at the time. And it was sitting, unattached, in an account because we didn't clean up after our testing. And we would provision 16 one-terabyte volumes to test out our striping setup. And then one day we went and deleted them, and I think—it's been a decade now, and I still think about that because either the EBS team was supremely happy that I just gave them back a ton of capacity, or really sad that I just gave them back a ton of capacity.


Jesse: [laugh].


Pete: I don't recall any big price cuts afterwards, so I don't think it was too bad. Next on the list was some EFS announcements. And I got to say, they talked about the history of EFS announcements through the year, and in many ways, I think that was what we noticed about Storage Day was not necessarily a here are all these great announcements we held onto for the day, but really, let's summarize all of the hard work that was put into place over the last year.


Jesse: Absolutely.


Pete: And so that's why in a lot of ways, I think Jessie and I are both, kind of like—thought this was very underwhelming, but it's because our noses are so close to Amazon, we look at the blog constantly, we obviously follow Corey’s newsletter. It is a requirement of Duckbill: you have to all sit around together and read it out loud. We take turns; it's fun—


Jesse: It’s great.


Pete: And because of that, we see these features come out all the time. And we know that we are an outlier. Not everyone has that ability. And so that is where—they spent a lot of time talking about these new features, and with EFS, they talked about features like they've added support for Fargate earlier in the year, for ECS, for Lambda, for EKS—Fargate on EKS. But they did just add one additional feature that kind of feels like maybe it wasn't a feature. I don't know, Jessie. What was this feature? And how amazing was it?


Jesse: Yeah, drumroll please: you now can directly create and attach EFS to EC2 instances at launch, through the EC2 Console Wizard.


Pete: So, basically, the EC2 console team discovered a brand new service.


Jesse: And I think that's the big story here is that all of a sudden, this new feature is available. But let's talk about that because it's not really moving any major needles. Is it really the super innovative things that we're used to from AWS? No. I'm thankful that this is now available, and I'm thankful that this is a feature that we can leverage starting today, but and similar to what you just said, Pete, it just feels like the logical next thing to do, but I don't understand why it's part of—like, why is it getting its own announcement in Storage Day?


Pete: Yeah, exactly. I think looking at how this was before, you'd have to go into EFS, you'd create a file system, and then you go into EC2 and attach it, maybe after the fact, or maybe it was even an extra step in there. So, look, kudos. They’re removing a step or two steps into the process, and anytime you can do that, that's great. Of course, my sysadmin, automate everything, curmudgeon self says, why are you in the console anyway?


Jesse: Oh, yeah.


Pete: That that part of me just got—it was like, “Come on. We've automated this. Hopefully, this shouldn't be an issue.” But the other part that, I guess, annoys me a little bit to this is, have you spun up an EC2 server via the console recently? I call it the Christmas tree, the Christmas tree application.


And what happens is, the Amazon engineering team keeps putting ornaments on it; eventually, that tree is going to fall over. I don't know when, but the number of settings and tabs that you have to go through to get an instance, this is why services like DigitalOcean exist and are doing so well. Just give me a server and get out of my way. But the number of things that you might have to answer has got to be measured in the hundreds at this point when provisioning EC2. But now it's in—add one more to it because now you can turn on EFS, as well.


So, there were a lot of talk as well about FSx. That's a fully managed Windows file server. For a lot of enterprises, that's probably a big feature that they announced different user quotas and bandwidth quotas, but I think the biggest thing that we really were seeing was further integration with the Amazon AWS backup services with broader storage services, just adding more support for that. Which makes a ton of sense because if you're using a lot more storage services, if Amazon is providing you with more storage services, and you're any sort of business that has to backup that data, having an integrated way of doing that makes a lot of sense. And as we know, Amazon builds for the customers, right? They build what the customers ask for.


Jesse: Yeah, I'm really excited to see these new features released. It definitely feels like a step in the right direction, and it definitely feels like the correct way to help customers manage their backups across various different AWS services. I'm looking forward to seeing more usage—or more features of AWS backup in the future.


Pete: Absolutely. That pretty much does it. There is a whole slew of different services. You can actually go to the Amazon Twitch site to actually watch these videos. It's kind of nice background, you know, listen in. There were some questions that were asked along the way. I mean, all in all, I think it was a interesting presentation, talking about some stuff. And feels like maybe they're holding on to the really good stuff for re:Invent.


Jesse: Yeah.


Pete: Only time will tell. I think we'll look forward to what they announce later. But really, some interesting features. But I think at the end of the day, personally, a lot of this stuff just felt like a summarization of the year and not really brand new announcements. What were your thoughts, Jesse?


Jesse: I struggled with the same thing. A lot of these felt like logical next steps for releases, especially in terms of the new S3 tiers and the S3 metrics being available. But overall, I felt like so much of the information that was shared was just data without a story; without use cases. It was very difficult for me to understand, why is this an important thing that I should be celebrating alongside you? Why is this a feature that all of our customers are going to sing your praises for?


Pete: Yeah, I think at the end of the day, the real winner of this one is the AWS marketing team because they have this whole day of stuff that got us to watch it, and also to talk about it after the fact in multiple podcast formats. So, kudos to the Amazon marketing team.


Jesse: I will say the one thing that I did appreciate is, AWS did comment several times about their goal to focus on helping customers more during the pandemic. They did recognize that customers using AWS today need to be able to do more with less, or do more with the same amount of resources or the same amount of spend that they have now, given current economic times and given current restrictions in this pandemic. So, I really do appreciate and want to send kudos to AWS for acknowledging their customers pain points on that and giving more to customers from that. I just wish that it was a cleaner overall thread and cleaner overall story throughout the entire day of announcements.


Pete: Yeah, I think that's really what I missed on it as well is, I like the storytelling; I like to better understand the problems that people are facing, and then how this new feature is going to solve these problems. And that's kind of what it lacked; it kind of lacked the story behind it. So, we'll see what happens at re:Invent, and mostly just wait to hear what other awesome features that they've got in the bag that they're waiting to show us.


Well, if you've enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice, and tell us what is your favorite AWS storage service. Thanks again.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 20 Nov 2020 03:00:00 -0800
What I Don't Get About the AWS Gateway Load Balancer

Want to give your ears a break and read this as an article? You’re looking for this link.

Sponsors

Never miss an episode



Help the show



What's Corey up to?

Wed, 18 Nov 2020 03:00:00 -0800
The Place to be for the Important Deets with Brooke Mitchell
AWS Morning Brief for the week of November 16, 2020 with Brooke Mitchell.
Mon, 16 Nov 2020 03:00:00 -0800
AWS Storage Day 2020

Links

  • Follow Last Week In AWS on Twitter

Transcript

Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.


Pete: Hello, and welcome to AWS Morning Brief. I am Pete Cheslock. Corey, while being back from his paternity leave, is still not here. We are having too much fun. And by we, I mean I'm joined again with Jesse DeRose. Hey, Jesse.


Jesse: Thanks as always for having me, Pete.


Pete: It's so much fun to again chat with people outside of my little family unit, that we've just decided not to give this back to Corey. And luckily, Corey has many other podcasts that he does, he was pretty happy to give it away.


Jesse: I feel like you should never talk about your children that way, but he's got a plethora at this point. So, he's willing to kind of share the wealth.


Pete: Exactly. And if you notice, we have a new theme song that came out, I think it was last week was the first week that we brought in the new theme song, which is I think much in line with a previous episode where we talked about ’80s breakdancing movies that the new theme song kind of has that vibe to it.


Jesse: I hope you're wearing the Members Only jean jacket that I sent you, along with the shades to match the uniform.


Pete: Yeah. I mean, I was born in ’80, so the ’80s for me, I was very young. I'm kind of waiting for the ’90s movies to come around again because I want to rock out my JNCO jeans and my wallet chain.


Jesse: [laugh], yes.


Pete: And all that good stuff.


Jesse: I am ready.


Pete: Exactly. Well, what are we talking about today? Well, earlier this week, AWS Storage Day 2020 happened on Tuesday. If you were a part of that, it was a free online event. As Amazon called it, a full day online event. Except it was only about four hours long, so kind of mailing it in on that one, huh?


Jesse: Can we start discussing that with our boss and say that a full day of work is technically just four hours? Can we just start working with that going forward?


Pete: Yeah, we'll just say it right now. So, hey, Corey, we're done for the day. Put in the old college four.


Jesse: [laugh]. That's what you say, “I put in the old college try. I just did my full day of four hours, according to AWS. So, this has been great. I'll talk to you tomorrow.”


Pete: Exactly. Well, Storage Day this year—it's the second year in a row if I'm remembering it correctly. 2019 was the last year they did that—and I feel like this kind of ties into the fact that there's just so many announcements that happened around re:Invent, that leading up into re:Invent, you have a lot of announcements to maybe soften the blow for a lot of folks. And Storage Day, really is just this whole day—well, four hours worth of a whole day—talking about everything related to storage. And we're talking about things like S3, EBS, EFS, FSx, for the five huge enterprises that probably use FSx.


Although if you actually do use FSx, I'd be curious to hear about how you like it and what you think of it because we don't really hear a lot of people using it. But these are all the services, plus many more, that Amazon talked about as part of its Storage Day.


Jesse: Yeah, it was a really interesting discussion. I greatly appreciate that AWS broke out this discussion prior to AWS re:Invent, but they dropped a lot of knowledge on us all at once, and in, like, rapid-fire succession, I was really, kind of… not necessarily surprised, but there's a lot of information that they shared all at once. And I have to admit that after sitting through this presentation, I now have a greater appreciation for Apple's slow presentation style. As much as I hate it; as much as I hate sitting for an hour and a half for one announcement while they toot their own horn, I have to say that the buildup and getting me involved in the story and bringing me along with them. It works, it absolutely works. And it was kind of hard for me to pick up on all the things that went on during AWS Storage Day this year because there was a lot of things going on.


Pete: And honestly, the fact they give so much information is really amazing in, I guess, both their ability to tout, in many cases, minor feature changes that most SaaS businesses would just turn on and maybe blog about. But this is—obviously the engine of AWS is so good at discussing their wins. But you're right, it's just a huge amount. On Monday, Jeff Barr of course, wrote the blog post with a lot of these details, linking to countless other blog posts. And I think it really speaks to just how, probably every, or nearly every Amazon service ties into storage in some way. It's a huge, huge part of this ecosystem.


Jesse: Absolutely.


Pete: So, as you can imagine, there were so many new features that we're not even going to be able to cover them all throughout the course, but we did want to call out some of the big ones, or at least what we thought were the biggest ones, the most interesting new features, new product announcements that came out, and also just touch on some of the other things that we thought were pretty interesting as well. And yeah, there was a lot of fun stuff. I think the biggest one that was announced was the S3 Intelligent-Tiering, which is a class storage tier within S3, adds additional levels of archive access. So, if you imagine Intelligent-Tiering, you know, you have the automatic tiering of data from frequently accessed to infrequently accessed as things age out, they essentially automate that for you. So, as things are not accessed, you just start automatically paying less for them. And anything automatic in a cost savings world is going to help you save money.


If you don't have to think about it and it just does it for you, it's fantastic. Well, Intelligent-Tiering added in these additional tiers—which they are Glacier—level tiers. They are additional places that your data can eventually move to as they start aging out based on a whole series of criteria. But there's caveats. There's more caveats now.


Before, one of the interesting things that we actually learned as part of this—because it was buried in a pricing page footnote—is that when you store something into Intelligent-Tiering, there is a minimum storage time period that you will get charged for. It's one month; it’s 30 days. So, you don't even want to use Intelligent-Tiering if the life of your data is less than 30 days. Kind of makes sense if you think about it because that's the point. You're not sure how long something might live for, put it there; Intelligent-Tiering will kind of take care of it.


But these new archive tiers add another piece of complexity, and that has to do with the speed at which you can recover that data because these additional tiers—Archive and archive tiers—within Intelligent-Tiering are Glacier-class storage, which means you will eventually wait the same amount of time as you might wait for a Glacier response. So, can your application support waiting—what is it, Jesse—six hours? 12 hours for a response?


Jesse: minimum, yeah.


Pete: it's something you have to keep in mind that your app has to support that weight when you request that object. And you can expedite it; there's a charge for it—there's a charge for everything on Amazon, of course—but you have to really plan a little bit more. It's not as plug-and-play. It's not as, like, flip a switch and magic happens as maybe it felt like when we originally looked at Intelligent-Tiering.


Jesse: I will say that this feels like a logical next step in terms of adding additional storage tiers to Intelligent-Tiering. When you look at the number of storage tiers for S3 in general, you have a number of options, including the Glacier Archive options. So, adding similar functionality to Intelligent-Tiering: it feels right. It feels like a logical next step. But there really are these caveats that, from a business perspective, we don't recommend just turning it on.


You really need to think about what is the access pattern of my data. Or alternatively, if you don't know the access pattern of your data, at least understand going in, using Intelligent-Tiering, that there are caveats, there are additional charges if your data is stored for less than 30 days, for example, or if that your data is stored for more than 30 days and ends up in one of the Archive tiers, or Infrequent Access tiers, and then needs to be restored, there are retrieval times associated with that. So, there's a lot of really, really great features here for companies who are using S3 Intelligent-Tiering and know their S3 data's access patterns. So, if you know how frequently your data is going to be accessed, you've got great, great functionality here to pay less as you store your data long term. But keep in mind that it's not something that we recommend plugging in automatically. We've seen so many companies who are not leveraging the S3 Intelligent-Tiering or leveraging the S3 Infrequent Access functionality correctly and end up charging more than had they just kept everything on S3 standard.


Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.


Pete: Right. There's monitoring charges, right? You have to pay for monitoring your files on there. So, when you think about your files on S3, if you have a large number of sizable files, then Intelligent-Tiering can make a lot of sense. You can't even turn on—you can’t—I say ‘turn on,’ but it's a class so you would move objects into it.


You would move objects into different classes of S3 storage, but you can't even move objects in there that are below a certain size. I think it's somewhere, like, 128 kB or 256 kB, I can't remember exactly, but there is a minimum size limit to storing objects in there. So, that's a caveat that you have to think about. From the docs, Amazon tells you.


They're upfront about this; They say, “Listen. Objects in your archive access tier are retrieved in three to five hours.” By ‘retrieved’ they mean moved back to the frequently accessed tier, where they'll have to age out again. That's another caveat, you have to think about. If you have things that are constantly moving into those archive tiers, and then back into the frequent, you have to then wait again for them to not be accessed.


Again, those little things that are just caveats. The way a lot of people use S3, though, it's a dumping ground for data. This will probably save you money. And you can turn on some of the really great S3 analytics services within Amazon to start analyzing your S3 usage. Look at those reports and figure out the oldest age data, how frequently are things being accessed, all this stuff exists within Amazon. You can go and turn that on and—groan—QuickSight. Ugh, QuickSight even has some default—like it's a data endpoint you can point QuickSight at to run some additional reports. So, again, you can do a lot of this analysis just on your own. It's great. Of course, you can call me or Jesse, and we'll help you do it as well if you don't have the time. But—


Jesse: absolutely.


Pete: we're happy to look at that stuff. There are some cool things though, and I want to talk about that because you can automatically pay less than $1 per terabyte per month when your objects go into that Deep Archive when they haven't been accessed for 180 days or more. That is in-credible. S3 Standard? $23 a terabyte, I think? Something like that, depending on volume, usage, everything else, and any sort of discounts that you might have. It's a lot. 20-something dollars a terabyte.


We're talking less than $1 per terabyte. Think about all of those documents that you have that, like, just no one's accessed. That's pretty impressive that you can spend so little for that. The next thing that I thought was really great: setup is very easy to do. It has filtering support, it has tagging support, object tags, or object prefix.


I mean, that is, admittedly, also really helpful. You don't have to have stuff in certain locations to tier it out. You could have your application essentially writing things into these certain areas to have them be part of these settings. So, that's a really helpful feature that they add for you.


And finally, I love the fact that you can actually define the number of days to set the aging out of this data. It's not, like, 90 days for archive and 180 days for Deep Archive; they are classes that you can enable—so you don't even need to use Deep Archive if you don't want to—and you can specify the number of days. So, maybe you say to yourself, “Well, I don't want to use Deep Archive because it doesn't appear that it has an expedited request if I want this data back faster—” which again, there's a charge for—” Doesn't support that. So, I only want to use Archive tier.” You can do that.


And maybe you say, “I only want to use Archive tier when things are over, you know, 120 days, 180 days, 300 days.” Who knows. You can make all those settings for you. So, it does give you some flexibility there and you can make it work for your organization and your use case.


Jesse: Absolutely. I think that's a really important thing to highlight, which is, you get to pick if you want to use these additional cheaper storage tiers. And not only that, you get to pick when objects are transitioned into these storage tiers, which we have not seen as readily available for other S3 storage tiers. So, this is a fantastic opportunity to use these storage tiers, if it fits your business use case. So, I think that's the big asterisk here; whenever you see a commercial for prescription medicine, it always says, “Talk to your doctor if this is right for you.” Talk to your teams, talk to the business, see if this Intelligent-Tiering Archive Access and Deep Archive Access are right for you. They might be, they might not. It's really just additional features that you have the option to use, if what your lifecycle policies are for your S3 data.


Pete: Yeah, exactly. And if you have a great relationship with your Amazon account manager, you should send them a message and say, “Hey—”


Jesse: absolutely.


Pete: “—we're considering this. We've got a lot of buckets—” because let's say you have hundreds of accounts like a lot of people do; it's very easy to create accounts now with Amazon organizations. So, you've got hundreds of accounts. And within those hundreds of accounts, you have hundreds of buckets or more because now there's no longer the hundred bucket limit. How many people actually remember that one? Do you remember that one, Jesse? Remember the hundred bucket limit in S3?


Jesse: Oh my God, yes. It had plagued my dreams.


Pete: Yeah. I remember being a very early user of Amazon and asking for that to be increased, and it was the first time I ever got a, “No,” from Amazon. [laugh].


Jesse: Yeah.


Pete: But you have this potential; you could have more than 100 buckets in an account now. You could have countless accounts. How to even begin to understand what you might save from that is a big challenge, and if you are not an expert in analyzing your cost and usage report, or are not working with us at Duckbill, what are your options? Like, how are you going to troll through all of these enumerations of your S3 usage just to figure out if this would be worthwhile for you? And my answer is usually, go reach out to your account manager. Especially if you have a support contract that you pay real money for, this is something that they should absolutely be able to help you out with.


And they can run you a report. They can do some analysis for you to give you a feel for what you might save especially, again if you're in the petabyte range of S3 usage, or greater—multi-petabyte—this is something you definitely want to be looking at. And these additional archive tiers, too, really could be a huge impact to your bottom line costs. This is just one of the many announcements from AWS Storage Day. Like I said, it's really hard to fit this into one episode, so we're not going to do that.


And we're actually going to split this into the next episode. So, stay tuned for part two of TBD—probably just two. Two of two—where we'll talk about some of the other really fantastic announcements from AWS Storage Day, more cool stuff with S3, some EBS changes, EFS—because that's still around; people are still using EFS—and a whole bunch of new features for FSx. And really, we'll dive into some of those changes, we'll dive into some of the other announcements from that day, and give you our impression on where all this stuff is going. Which is really, really amazing to see just, again, the level of innovation that is coming out, or maybe just the speed at which all this stuff comes out.


All right, well, if you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and still give it a five-star rating on your podcast platform of choice, and tell us how much you love using FSx because you might be one of the, I don't know, two or three people that are actually using it. Thanks again.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 13 Nov 2020 03:00:00 -0800
Why AWS Announces Regions in Advance

Want to give your ears a break and read this as an article? You’re looking for this link.


Sponsors

Never miss an episode



Help the show



What's Corey up to?

Wed, 11 Nov 2020 03:00:00 -0800
The AWS Tea is Hot. Some, calling it Lipton.
AWS Morning Brief for the week of November 9, 2020 with Jam Leomi.
Mon, 09 Nov 2020 03:00:00 -0800
Certifications: The Good, The Bad & The Ugly

Links

  • Follow Last Week In AWS on Twitter

Transcript
Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.


Pete: Hello, and welcome to the AWS Morning Brief. I am Pete Cheslock. Corey is not here. He's never coming back. No, I'm just kidding, he's just not joining us for the Friday Morning Brief for a little while. Maybe we'll invite him back as a guest, but until then, I'm again joined by Jesse DeRose. Welcome back yet again, Jesse.


Jesse: Thank you so much for having me, I am so happy that Corey has not figured out that we just reset all of his passwords to ‘1234’ and locked him out of everything.


Pete: We did add an exclamation point to the end, and we made it very secure, but I do think it’s the—


Jesse: Very secure.


Pete: —it’s the ultimate troll to essentially take over Corey’s podcast for a period of time—while of course, he's taking care of his children—and essentially just inviting him back as a guest on it. So, I think that'll be fun. Maybe we'll have to do that: invite him back as a guest on his own podcast.


Jesse: I love it.


Pete: Well, we're here today to talk about, maybe, potentially contentious topic certifications. Are they good, or are they a bag of crap?


Jesse: This is a spicy one. I'm excited for this conversation.


Pete: So, certifications, this is a business that's more profitable for AWS than SimpleDB is.


Jesse: Nailed it.


Pete: Their whole certification ecosystem has really just blown up. I mean, I've been a part of the Amazon ecosystem since nearly the beginning, working for a startup back in 2009 timeframe; we were very early, and there was no certification, there was no re:Invent. I mean, all that stuff came after. And just looking now at the amount of certifications that exist, you've got, kind of, your default Cloud Practitioner level, you've got Solutions Architect Associate Level, Developer Level, you've got Professional Level, you can be a DevOps Engineer Professional.


But then, more importantly, they even have these specific specialties in addition, so you can have an advanced networking specialty, or an ML or data analytics. It's really interesting how this has just exploded across the ecosystem, and having been to many re:Invents, they put a good amount of effort into certifying a lot of engineers at those events. But Amazon certifications are actually not the only thing we're talking about today. It's a big part of what we're talking about, but there's a lot of certifications out there. And for a lot of people, that's how they got into the industry. So, there's potentially a lot of good, but that's not always the case.


Jesse: Yeah. I honestly have a lot of mixed feelings on certifications. And honestly, there's strongly mixed feelings on certifications. So, what I really want to talk about today is, are they good? Or are they crap? Are they things that are ultimately beneficial for you to sit for and to take, or are they a waste of your time? And honestly, I think it all really boils down to which certification you're looking at and what do you want to do with it? What's the ultimate end goal for getting this certification? Because that can ultimately really influence whether or not this certification is going to be worth your time and money.


Pete: Exactly. I mean, what is the point of these to begin with? I mean, other than being just a great cash cow for some businesses?


Jesse: Yeah, I like to think about it like—I compare it to a college degree. I know it's not but I think about it in the same sense of like—


Pete: See, that's a very spicy comparison for some people who have paid lots of money for a college degree—much like myself—to compare it to a certification, but I like where you're going with this, so give it to me.


Jesse: I'm sorry for all the listeners who just dropped off and returned back to the latest episode of the Adventure Zone or Serial. I appreciate for those of you who are still with us to continue on. For me, a certification can provide a lot of similar opportunities for a college degree in terms of, it's a way to validate your knowledge. It's a way for you to prove, “Hey, I understand these ideas, these concepts,” that maybe you wouldn't be able to validate otherwise. And it validates your knowledge externally, and it gives you the opportunity to show a potential employer, “Hey, I have proven that I am familiar with these topics related to your business, and that is why you should hire me, or that is why you should consider giving me this promotion or giving me this opportunity.” It really gives a candidate an opportunity to derisk yourself. And I have proof. I have third-party-validated proof that I am familiar with these things.


Pete: Look, Jeff Bezos personally signed—actually I don't know if that's the case. It's probably Andy Jassy—personally signed my certification. So, it's like Andy Jassy is giving me this job recommendation, and Andy Jassy’s stamp of approval.


Jesse: “Do you want us to get Andy Jesse on the phone? Because we can get him on the phone right now, and he can confirm that he personally approved me for this role.”


Pete: Exactly. I mean—and I, of course, say that he stamped mine. So, interestingly enough, I do not have any Amazon certifications, but you do Jessie.


Jesse: I do. I have the Solutions Architect Associate certification.


Pete: So, I have at various points in the last couple of companies I've worked at have looked at getting an Amazon certification, and honestly, I have had the same kind of thought processes you just mentioned, which was, what will this give me? And will it give me for my time, and let's not say the money aspect because, for all these scenarios that I'm dealing with, the company was going to pay for it.


Jesse: Sure.


Pete: So, that was less of a risk, but it is my time. I don't want to take it and fail it, and have to take it again; that's just wasteful. So, I’d want to spend some time preparing and reviewing. But if I were to get—at this stage of my career, having been working with Amazon for a long time, if I were to get a Cloud Practitioner or Solutions Architect Associate, does this open up any doors for me? And to be honest, at my stage of my career, it doesn't really help me that much. Now, it does help someone—and this is actually something that I wanted to talk about because when you work for a company that is an Amazon Partner, to be at a certain level, you have to have a certain number of people certified; it's the whole point of it. If you're kind of working on behalf of Amazon, they want to make sure that you have a certain number of people certified at a certain level, internally. And again, it's that validation. It's proving that you have enough people who understand Amazon to deliver, maybe, a software or a service on top of their platform. Again, it's that external validation. And at the same way that maybe having that certification derisks yourself as a candidate, all of those certified engineers within your organization essentially derisks the organization from both the, maybe, consumer of your service as well as from the Amazon side. For anyone who's been to re:Invent, you'll walk around the booths and you'll see Amazon Partner Network Select or Platinum or Mega Ultra Platinum, or whatever is on the Datadog booth, that they got Emerald level. Those types of messages makes a consumer of those services maybe feel a little more comfortable dedicating some part of their business to them.


Jesse: Yeah, it really makes the consumer comfortable knowing that these people know what they're doing. I am more comfortable giving this vendor my money because I know that they are certified in some capacity. It's the same way if you think about it outside of the tech world, if you look at a company that is certified by the Better Business Bureau, or certified with—any restaurant will throw a, “Our customers love us on Yelp,” sticker up in their window to show, we get a 4-star rating or higher, or a 4.5-star rating or higher, or whatever, on Yelp. It really markets the company better to show, “Hey, you should invest your money with us,” whether it's a restaurant, whether it's a business, whether it’s—whatever the company is.


Pete: Exactly. And so let's talk about the flip side because there's definitely a lot of—me getting a certification just to help my business, it's not, again, providing me anything. Maybe it's providing my business something, but for a lot of people and myself included, certifications are actually how I got into this industry. I quote, “Knew computers,” in the early aughts when I was in college, and I started working for an internet service provider, which is, like, the OG SaaS company, I mean, before SaaS companies existed—we were an internet service provider and we were a hosting company. And I literally went to a bookstore, you know, like a physical place that sells books—I know it's hard to imagine in this world, but you could walk in, and you could walk out with a physical book—and I bought the A Plus certification book, and it was just—


Jesse: That book is a brick.


Pete: Three inches thick at least, this thing is. And I also bought the Network Plus book, and I took them home, and I read through them, and I took the certifications. And those certifications essentially got me into this company. They let me elevate from being a support engineer taking calls and fixing people's internet to going and installing T1 lines and installing DSL, and then moving to the data center side, and becoming a network engineer. And those certifications, I think, really gave me a great knowledge when I was coming from zero.


So, this is one of those things that I feel like is always so frustrating to me, and I get a little annoyed when people are like, “All certifications are garbage.” And I was like, “Many certifications are, but many really lift a whole group of people up and out of this funk that they might be in where they just don't have this expertise.”


Jesse: This is the one big selling point for certifications for me. If you are looking to get into a new field, or if you're looking to explore deeper within an existing field, that you already had maybe a little bit of surface knowledge, a certification will absolutely give you some academic learning to go out and apply in the world.


Pete: Yeah, now granted the A Plus certification, I couldn't speak for how it is today. Again, it's a great cash cow for the company that, probably, operates it, and the test preps and the book-selling and things like that, but it did have a lot of useful information that I absolutely still remember. The other certification I remember taking—this one was a little bit later in my career—was the CCNA certification. This is when I was working for a company doing consulting, a lot of Cisco Networking PIXes and ASAs, for the Cisco folks out there that remember those; maybe they're still around, I don't know.


And I remember taking that CCNA and, sure, there was a lot of things Cisco specific, like certain commands and things like that, but a big part of the CCNA was how to subnet, like, how to subnet by hand. And wildly helpful. I used it all the time for years as a network engineer, and then everyone went to the Cloud, and it's like, “Sweet, I don't have to deal with subnetting anymore.” And then VPCs came out, and suddenly it's like, “You got to start subnetting again.” I feel like there's a whole generation of people that didn't need to learn subnetting. [laugh].


Jesse: It's like the computer science students who went to college and learned how to code the list function themselves before they actually learned that the list function existed, to give you that underlying building blocks, to give you those underlying principles that you can then apply to other parts of the world, other parts of the work that you're doing. So, maybe you don't end up needing to create a function from scratch or create something from scratch, but you have that underlying knowledge that allows you to then apply other newer data about the Cloud, about AWS, about technology, on top of.


This episode is sponsored by our friends at New Relic. Look, you’ve got a complex architecture because they’re all complicated. Monitoring it takes a dozen different tools. Troubleshooting means jumping between all those dashboards and various silos. New Relic wants to change that, and they’re doing the right things. They’re giving you one user and a hundred gigabytes a month, completely free. Take the time to check them out at newrelic.com, where they’ve done away with almost everything that we used to hate about New Relic. Once again, that’s newrelic.com.


Pete: I think it's really good to point out, too, is that company I worked at where they really—not forced me but, you know, strongly encouraged me to get the CCNA was also—needed a certain number of CCNA, CCNP, I want to say. Obviously, there's the CCIE, which none of us at that company had. But they needed a certain number of those engineers to reach certain partner tiers within Cisco as a Cisco reseller.


Jesse: Yeah, I think one of the other things to note here is that I worked for a company a number of years ago who specifically incentivized employees to get their AWS certifications, and part of it was an opportunity to get everybody up to a certain speed, everybody on the same page, but part of it was also security. When compliance time rolled around, we wanted be able to say we have this many people who understand what they’re doing with our cloud services.


Pete: Yeah, again, it's derisking. It's making everyone feel a little bit better that they have engineers, they have employees within the business that have taken this test and have validated that they have this knowledge. Now, here's a question for you, Jesse, which is, should I go get a cert? I mean, I've looked at the practice tests for Cloud Practitioner—that's the first level of the AWS certs—it says six months of fundamental AWS Cloud. I looked at the preview questions; seems pretty easy, so let's assume I go and take that and pass it.


But then do I go down the path for a Solutions Architect? Do you think that that certification gave you value? Was there something that you fundamentally learned? Like, in the CCNA I learned subnetting that I still use today. Do you feel like there was something similar within the Amazon certifications that you've taken? Or do you think there might be, at even the higher level of the Professional level?


Jesse: I think that there is probably much greater knowledge at the higher Professional level than at the entry-level Associate’s certification level. For me, I already had a number of years of experience with AWS going into this exam. So, I felt relatively comfortable with all of the basic building blocks that the certification covered. The hard part for me was the way that the certification discussed the various topics. So, specifically for me, AWS wants you to do things the AWS way, so they always want you to use AWS services as a solution rather than building a solution yourself, or bringing a third-party solution into the mix through something like third-party vendor, like if you need an observability tool or something.


I feel like there is a massive disconnect between the theory and the actual practice of some of the content and these certifications. So, specifically in my case, the theory for the AWS Solutions Architect Associate certification said, “For everything you want to do, do it with an AWS service.” But in the real world, that's not how any company runs, even the companies who are cloud-native, even the companies who want to use as much of AWS, or their cloud provider, as possible. That is not what they're going to do; they're not going to use all of AWS services for every solution. There's a cost-benefit trade-off to consider, and I think that's something that at least this particular AWS certification—and I would assume more certifications—don't think about.


Pete: Yeah, I think that's a really good point, too. And obviously it makes sense: they're going to optimize, and they're going to build their tests for a business that consumes the full suite of Amazon services. And to your point, not everyone's going to do that. They're going to have different architectural reasons to not do that. They might fear the great wrath of vendor lock-in and not want to use EMR or something, or not want to use Amazon Elasticsearch, they want to run it themselves. Lot of different scenarios like that.


I think the other thing, too, that I would be super curious about if anyone has taken the specialty level Amazon certifications, how valuable that people think those are. They're obviously, maybe more generic. I’d be curious what the advanced networking looked like, or the security one. Are those generic enough where you'll learn about cloud security by taking the security one; you'll learn about some advanced networking. Obviously, there'll be an Amazon spin, but networking in Amazon is not easy, especially if you have a hybrid-cloud scenario, and you want to bring Direct Connect and various peering locations that you're connecting to, the complexity starts getting off the charts really fast.


Jesse: Yeah. If you are looking for some fundamental knowledge for one of those areas, for one of these specialty certifications, or even one of the non-specialty certifications within AWS, it's worth going through some of the practice material. Try a couple of the practice exams, try a couple of the questions, maybe go through some of the free content on one of the test prep websites. See how you do. See if you get all the questions, see if you get none of the questions, and then gauge how much you might learn from there and see, is this ultimately worth your time?


Pete: Yeah. And that's the question you really have to ask yourself. I would definitely say that if you're early in your career—and when I say early, I mean, more so, do you think that you're going to be working with Amazon Services for the next three to five years? And do you like—or maybe you're not yet into working with Amazon Services and you're trying to break into that. Maybe you're working in traditional data center modes. Both of those types of individuals I think should absolutely go down the path of getting a certification.


I think it's a no-brainer if your employer will pay for it. And you should definitely ask them if they'll pay for it because then you're not having to obviously spend your own money, it's just really your own time. And they should pay for it. They'll get value of it, they might be an Amazon Partner, they might be trying to elevate their level. But it's just your own personal growth, and it can provide value to many people out there.


Jesse: Absolutely. I think that any company that wants their employee to get one of these certifications should pay for it because they're ultimately showing the employee that they value them as a person, and value them as an individual, and want them to grow, they want to invest in their further career development by giving them access to these resources to study and to take this exam.


Pete: And finally, if you actually want to list your services, your own personal consulting services on the AWS IQ service, you need to be certified. All of those consultants that can list on AWS IQ are certified, so that's just kind of table stakes from there.


Jesse: Yeah, having a certification can be table stakes to a lot of opportunities in the same way that a college degree is table stakes for a lot of opportunities. Showing that you have spent the time to validate your learning and validate your knowledge is absolutely worthwhile if you are looking for an opportunity that requires a degree or requires some kind of certification, requires some level of, you know, three years of experience or degrees. Something that says we want you to prove to us that you have some kind of experience with this topic, with this information.


Pete: Yeah, it's third-party validation, largely, that this third party has said, “Yeah, this person knows what they need to know to meet this level.” And for a lot of businesses, sadly, a lot of businesses still do that checkbox hiring. If there are two candidates, and they both have a couple years experience, my guess is the candidate who has those relevant certifications will, sadly, be looked at a lot higher than someone who doesn't. Yeah, that's just the sad state of the checkbox hiring world we live in. But to be honest, a lot of that happens, too, with college degrees to candidates similar background, one with a college degree and one not, my guess is most companies are probably going to go for the one with a college degree. So, given that, again, that third party validation could be the difference between a new role and not.


Awesome. Well, Jesse, thank you again, for joining me, I think certifications are always a really interesting topic. It's great to kind of talk about both sides of it. The good, the bad, the ugly, and the profitable for AWS.


Jesse: [laugh]. Absolutely. I think this is a really important topic that everybody should be mindful of and really ask themselves if they're looking at getting a certification, is it worth your time?


Pete: And one certification that is absent from here, I'm noticing, they have Alexa Skill Builder: great. Certification course, you definitely need that one. But what they don't have is any certification specific to cost optimization, cost analysis—


Jesse: Ooh.


Pete: —nothing related to that. So, I'm just calling that out to any of the Amazon folks that listen in, and haven't sent us hate mail yet, I am very curious why there is not a certification for essentially cloud economics, so let us know. I would personally love to see if that one is in progress. I would love to look at anything related to it because that is actually the one certification that I think Amazon needs to have. I think more people need to understand their economics of Cloud and a certification might actually help people finally go learn that.


Jesse: It's one of the pillars of excellence for AWS, so I definitely would love to see a certification on this.


Pete: All right. Well, we're putting it out there, Amazon. Show us what you got. So, awesome. Thanks again, Jesse. Really appreciate it.


If you enjoyed this podcast, please go to lastweekinaws.com/review and give us a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice, and tell us how many certs that you have, but more importantly, how many of those are still valid. Thanks again.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 06 Nov 2020 03:00:00 -0800
The Other Side of Paternity Leave

Want to give your ears a break and read this as an article? You’re looking for this link.


Sponsors

Never miss an episode



Help the show



What's Corey up to?

Wed, 04 Nov 2020 03:00:00 -0800
Did He Put Your Million Dollar Check In Someone Else's Box
AWS Morning Brief for the week of November 2, 2020 with Courtney Wilburn.
Mon, 02 Nov 2020 03:00:00 -0800
Blinded by QuickSight

Links

Transcript

Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.


Pete: Hello, and welcome to the AWS Morning Brief. I am Pete Cheslock. I'm still here. I'm going to be here for a while I guess, but not alone. I'm here with Jesse. Jesse, thank you again for coming on board and keeping me company.


Jesse: Always a pleasure.


Pete: It's honestly just nice to talk to someone else that's outside of my little family unit or my pandemic crew.


Jesse: I would say it's nice to get paid to just talk about my feelings. But I mean, I'm not technically getting paid for this.


Pete: Yeah, I feel like I'm just trying to balance the conversations with coworkers, podcasting this, my kids at this point, have more Zooms than I do.


Jesse: [laugh]. I think that probably says something about our social lives and about ourselves. And I feel like I need to go rethink everything.


Pete: Well, my son who is six years old, he does a better job of managing his mute button than most full-grown adults I know.


Jesse: I feel like that's the fun thing. I really want to see how the next generation is going to grow up with technology, better understanding the mute button, and all of this video content than we do.


Pete: It is hilarious to hear my daughter yelling at her friends, “You're on mute.” [laugh]. Oh, well, what is not on mute today is both of us. We are talking about the most loved Amazon service, Amazon QuickSight.


Jesse: I think it's technically going to be on blast today rather than on mutes.


Pete: Yeah, I think we're going to struggle to keep this one on time. So, if we go long, I apologize in advance. But we're talking about QuickSight, which for those that maybe have never heard of QuickSight before, it's Amazon's business intelligence tool. The question you're probably asking yourself, to be perfectly honest, is why? Why did you even try QuickSight?


Like what point, what thing were you solving that made you think of QuickSight? So, we're going to tell that story. But first, let's just pivot into BI tools, business intelligence tools. That's the category that QuickSight is technically in. So, we'll talk a little about that, and also how we actually use BI tools within Duckbill because that'll give you, hopefully, the context into answering that question of, “Why did you even try QuickSight, Pete? Why?”


Jesse: I mean, I feel like there's probably still going to be people asking us why after this podcast, and I'm sorry for those listeners. We don't have an answer for you. Maybe we're just masochists. We don't know.


Pete: It's just because it's there, I think is what the final answer is. [laugh].


Jesse: Absolutely. So, business intelligence tools solve a whole variety of problems and we could probably do an entire episode on them in general. They help you gain insights from your data, which is fantastic. I absolutely love that this is even a category of service out there. But today specifically, to keep it on track, we want to specifically talk about gaining insights from your spend data, your AWS spend data. And to do that, we really need to start by talking about the AWS Cost and Usage Report.


Pete: Yeah, the Cost and Usage Report—you might hear it referred to as the CUR. I heard it referred to as the CUR often and it took me quite a while to actually figure out what anyone was talking about. So, if you hear someone say the CUR, they probably mean the Cost and Usage Report. But this is the v2, we'll call it, version of the Amazon billing data.


It's incredibly high fidelity, I think is the term. It's very granular; there's a lot of data in there. And it's not enabled by default; you need to actually go turn it on. But what's awesome about this tool is it can provide you some really deep insight into where your money is going, and the only cost for it is the cost to store the data. And the Cost and Usage Report itself, when you turn this report on and have it dumped into your S3 bucket location of choice, you can actually have it store into a couple different file formats.


One of them is Excel CSV format. And the other one is a Parquet format, which is a columnar data store and is a lot more efficient for this type of data. And it's the Parquet version of this that we use, we tell our clients—clients of Duckbill—to turn this on and turn it on with Parquet because then you can use tools like Athena to query your data and just leave it in S3 and run those ad hoc queries. So, Athena, though, which we're not talking about Athena, is challenging to use, in some cases—


Jesse: Yeah.


Pete: —you have to know SQL, which if you don't know SQL you're kind of in a bad spot. So, we use a BI tool, a very popular one called Tableau to query our data on Athena. So, Athena is kind of the engine, you could also obviously put your CUR data into an actual database. But largely, the queries we're doing, these are all human-generated. We're fine if they take seconds; they don't need to happen in milliseconds.


Jesse: Yeah, I mean, there's lots of solutions out there. There's third party commercial apps like Tableau and Looker—RIP—there's open-source options like Metabase. But of course then, in true AWS fashion, there's also a hastily integrated acquisition called QuickSight.


Pete: So, I have this memory in my head—and hopefully someone will correct me if they're listening to it, and I'm wrong here—but I feel like QuickSight was actually an acquisition. Like Amazon, which really doesn't usually acquire a lot of teams or businesses into Amazon Web Services, with like a couple of pretty rare exceptions, I'm almost positive, that QuickSight was actually some other product that Amazon acquired into it. But the history of QuickSight from at least the Amazon umbrella started around 2015 is when they announced it at re:Invent, and I was there for that announcement. I remember that announcement clearly, and I still actually kind of laugh at it when it came out. Now, first off, that was 2015 is when it was announced, and not for nothing, it does not look like it has gotten much better in the five years that it's been operating since launch.


But I do want to tell one story about that announcement at re:Invent that year because there was this common, I guess, trope of re:Invent. Amazon would announce a service, and people would play this drinking game, essentially—because that's what re:Invent mostly is. It's just one week-long drinking game—but they would walk the floor, you know, the expo hall of all the vendors there. And every time Amazon would announce some new service, you know, people would point out, “Oh, that company is going out of business. And that company is going out of business.” And I so clearly remember this, I remember when they announced QuickSight and you could hear people pointing to the Tableau booth—because Tableau obviously was there—pointing to the Looker booth and saying, “Ah, those companies? Gone.”


Jesse: Gone.


Pete: Vaporized, right? Amazon's going to stomp all over them. And then I'm also reminded of last summer's Tableau acquisition by Salesforce that was $15.7 billion. I'm also reminded of the Google acquisition of Looker this year $2.6 billion. So, meanwhile, QuickSight is still QuickSight. It's still there.


Jesse: Tableau and Looker are clearly doing well for themselves. And QuickSight is still QuickSight.


Pete: QuickSight’s still QuickSight. Now, this all goes back to, again, why did we even look at QuickSight?


Jesse: Yeah, we started with Tableau for our business intelligence use case, because that's what we've collectively used in previous companies. But there's arguably a case for each of these solutions, depending on your use case. So, we just started with Tableau because that's what we were familiar with. But we wanted to try QuickSight to see if it would solve our use case, the same as Tableau. We wanted to really get that apples to apples comparison, so to speak, to see if using QuickSight might get rid of some of the negative things that we experienced with Tableau, while also maybe doing some other positive things, to remove QuickSight and move all of our business intelligence work for your spend data into AWS itself and keep it all kind of collectively in the same ecosystem. So, we wanted to try QuickSight to see if it would solve our use case. And I will let Pete talk more about that once his eye stops twitching.


Pete: Yeah, it was quite an experience, that's for sure. Now, if you've never used Tableau, Tableau is… oh, how can I describe it? It's like a box of Legos. But not a box of Legos that’s, like, follow this really clear booklet and you'll end up with the Star Destroyer from Star Wars.. Or Star Trek. I don't know, one of those star movies. I'm angering so many people right now. But—is it even called the Star Destroyer. Yeah, it's a star destroyer, right?


Jesse: Yeah.


Pete: [laugh]. So, it's not like that. It's just a box of random Legos that you have to piece together yourself. Now, it's unbelievably powerful because you can build anything you want with that box of Legos, just like Tableau. You can build any sort of visualization.


It is so insanely powerful, it can connect to a ridiculous amount of different back end interfaces to JSON files, to geospatial data. I mean, it's got everything; it's really full-featured. And that's great. We can do a lot with it. It's also extremely well documented and that's super helpful as a Tableau beginner trying to dive into some of the Cost and Usage Report data.


But it's heavy. There's just a lot to it. And so we really wanted to see, is there something simpler? I know a lot of people use Looker, and they love it. And honestly, the main reason we passed on Looker was, honestly the Google acquisition did not help. Something about using a Google-owned product to help our clients analyze their Amazon billing data just felt not great to us. And also, we don't trust Google and pretty sure that sometime in the next year, they're just going to, like, quit Looker and just shut it down. Because that's the thing that happens.


Jesse: Yeah.


Pete: So, we went to QuickSight because we thought, “Well, hey, if there's something that's integrated, that's native to Amazon, then that could potentially be a really helpful way for us to maybe build some standard dashboards and templates that we can easily share with our clients and make available.” Like, that could be really cool. So, that's where we started. So, I went, and I actually signed up for QuickSight, and that's where I kind of ran into the initial thing. It's not like a normal Amazon service.


It's not like—there's a little bit of pay as you go, but it’s, like, pay per user. That's how it's billed out. And so you kind of end up on a landing page, and it's like, “Do you want to sign up for this?” But you do get a free trial, I think it's 90 days, if I'm remembering right, so there's not really a lot of risk here. And also, it's not super expensive, I think it's 10s of dollars per user, right, so when comparing to the cost of, like, a Tableau, which is much, much more expensive.


So, all those reasons are kind of like this seems like it should be good, and this should work out. Like, we should definitely dive into this one. So, I am a classic engineer, and I don't read directions before doing it. I really shouldn't have to; it should be that simple to set up. And I went through and clicked through on the QuickSight and setup Standard Edition because I didn't need all the enterprise-y features, and was trying to start building the dashboard.


And honestly, other than some of the very complex getting access setup working, which was mostly I think, related to the fact that it's like IAM, it actually was pretty easy to get up and running. And granted, my Athena database was already set up with my CUR data, so like all the complexity of that was taken care of, it really was just, go to QuickSight, add a data source, point it to Athena, and I was able to run some queries. And you know what? Within a really short amount of time, I was able to build some dashboards and using some of the documentation that exists to do that. So, that's, I think, the one good thing we ran into with QuickSight. It was quick… question mark? [laugh].


Corey: This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.


Jesse: [laugh]. Yeah, it was great to initially set that up and see that, out of the box there was a minimal learning curve to get from zero to dashboard. And I really appreciate that. There was definitely that opportunity that made me have false hope that this was going to be this amazing product that would solve all of our problems, which, joke's on me, because—


Pete: Lies.


Jesse: —yeah, there’s—


Pete: Lies.


Jesse: —absolutely. And so I started playing around with it, and I wanted to see how easily we could recreate some of the dashboards that we use in Tableau right now. So, effectively, for our listeners’ sake, we create dashboards in Tableau that we ultimately want multiple different end-users to be able to view. So, effectively, we want to be able to copy and paste a template of dashboards that different users will be able to view with different data sources, depending on who's looking at it, and really just make this as templated as possible, as cookie-cutter as possible. That's not the QuickSight way. That's not how they want you to do things. I built a—


Pete: It’s not quick. That is not quick, that part. It's like the concept of I create a series of I think analytics or dashboards, and I might want to change the data source from A to B. The same data, but I want to point to a different client, right?


Jesse: Yep.


Pete: That is hard in a way that I don't understand why.


Jesse: Yeah, it's this whole process to open the existing dashboard that you've already built, duplicate it as a new analytics item, and then re-export or re-save it as a new dashboard. Why? Why? Why can't you just change the underlying data source for the existing dashboard?


Pete: Yeah, exactly. And that's something that we do quite a bit with our Tableau. I mean, we can have dozens of Tableau reports and Tableau dashboards, make one change to the underlying data source, and everything will update. Again, the data is the same. So, the queries are the same.


It's the same SQL query, it’s the same type of queries we're running, we just want to point it at a different data set. And yeah, that was super hard, and honestly, I didn't really get a satisfying answer that how to do that without having to rebuild my dashboards. Now, I did find something that was pretty interesting looking to hopefully answer that problem. So, there is a website that is Amazon ran, but it looks like it exists on GitHub called wellarchitectedlabs.com.


And I hadn't seen the site before. I found it when I was doing some research into QuickSight and trying to answer some of these questions about how to, and I found actually this really great lab for QuickSight and creating some cost intelligence dashboards. And what I thought was really fascinating, and I wanted to try what this was like is, they had a concept of, “Hey, you can go and get access to these templates and load them into your QuickSight.” And I'm just thinking to myself, “That's what we want. That's what we want to build. Like, let me go and test out how that works.” So, this is where I ran into a new problem with QuickSight is I went through this lab, and one of the early points was like, “You need to make sure you have a QuickSight Enterprise Edition for this to work.” So, I just said, “Okay, great. Like, I'll just go and upgrade my QuickSight account from—


Jesse: What could go wrong?


Pete: —Standard to Enterprise, because I'm still—what could go wrong, right? And I looked at a couple of docs, and it was basically like, “This is how you upgrade. And oh, by the way, when you upgrade to enterprise, you can't downgrade.” And I think the docs even say something like, “You have to cancel your QuickSight, and then maybe start a new Standard Edition.” Which I thought was kind of hilarious. [laugh].


Jesse: That's odd.


Pete: It just feels very, I don't know, it's like, very wrong, very non-AWS is what it feels like.


Jesse: Yeah.


Pete: Anyway, I updated to enterprise, and every one of my dashboards and queries immediately fails. Just, nothing worked anymore. It was very sad.


Jesse: Very sad, and very, very frustrating.


Pete: So, I opened a support ticket, because that's what we do. Right? I actually got a response back pretty quick. And we hopped on a screen share, so I could kind of go through it. Again, the only change I made: upgrade to enterprise.


This feels like something that should work well. But after, I don't know, maybe like a week or so, we finally got a response back what the issue appears to be, although I'm not sure I totally trust it, is that it has to do with something IAM related, where I am accessing this account as a federated user-role-based access, where that's just how we access all of our accounts is we have users that exist in a separate account, then the account that I'm accessing QuickSight feels like a pretty standard use case to me, but the support ticket is basically saying that that may not work. So, I don't totally know, and we're still working with Amazon support to kind of solve this problem, but if that is not on the roadmap, we would love to see that on the QuickSight roadmap.


Jesse: Yes, please. Absolutely. And it also feels very AWS that they said that we need to use a normal IAM User Federation instead of a Federated User. Like, wait, what? Those things are the same in my mind, so I mean, at this point, we just seem to be arguing semantics.


Pete: Yeah, I don't talk totally understand it, but I'm going to try to be the good person, and I'm going to follow the steps given to me by support, mostly just to see, like, does it work that way? If it does work that way, I will probably laugh. I will then probably cry a little bit.


Jesse: [laugh].


Pete: But at least then I'll be able to finish this lab so I can just see what I came to see. I want to see like, there clearly is a way to do this templating thing and point things at different data sources. What does that look like? That's all I really… that's all I really want in this world.


Jesse: And I will say two QuickSight’s defense, we did get really fast response times for all of our queries, which is a definite plus a huge gain over Tableau and other resources that we were testing before. So, there are a couple things that QuickSight does well, but that list is kind of shorter than I'd like it to be.


Pete: Yeah, I agree. I also did not test one of the additional features of QuickSight, which probably would change my feeling on it, maybe a little bit. They have an in-memory query engine, they call it SPICE, which is short for Super-fast, Parallel, In-memory Computation Engine. There's an additional charge for usage there.


I haven't turned that on. Cost and Usage Report data, even if you're a heavy user, even if you've turned on, like, resource IDs, it's not that big and I don't need my queries that fast. If I make even some of the gnarliest dashboards showing me usage down to the specific resource ID, which that'll be an extremely high cardinality field. There’ll be a lot of things about every single resource, you might use every EBS volume, every EC2, every Lambda. I mean, there's just tons and tons of resource IDs, those queries still worked. Maybe it would take 30 seconds or something? It’s fine, right? I don't need a millisecond query. So, it's possible that maybe that's the game-changer technology. But we weren't even able to get that far. [laugh].


Jesse: Then it would really be true that the SPICE must flow in order for the rest of the universe to work as we want it to.


Pete: I hope that there was a Dune fan who came up with that, like, they came up with the term. Because, like, this is probably what happened is they came up with the term ‘SPICE’ and there's some internal Dune-related thing. And they had to like back into, well, what could SPICE stand for because I think it's the super-fast part of this that makes it feel like they backed into that analogy there. [laugh].


Jesse: Yeah, and saying that this SFPICE, will flow just doesn't have the same ring to it.


Pete: [laugh]. Oh, wow, that was dad joke level bad there, so I think—


Jesse: You’re welcome.


Pete: —I think on that note, is where we will put an end to QuickSight for today, at least. Maybe we will come back to this. But I'm really curious. Does anyone out there use QuickSight? Do you like it? Is it a necessary evil for you? Was it just there and you're stuck with it now? I'm super curious.


These BI tools, there's a lot of them out there. I feel like they're getting more popular. More and more people are using these tools to overlay business metrics with usage metrics and cost metrics and things like that, so I'd love to hear from people who actually use this. And I hope someone out there can just tell me that like, “No, no, no. You're just doing it wrong. You're just using it incorrectly, and if you use it in this other way, it's actually really great.” So, I'm going to wait to see if anyone actually does that.


Jesse: To go back to our previous point, like, I do feel like this is therapy for me because we get to talk about things that are broken, and then eventually somebody on the internet will correct me and say, “No, no. Why aren't you doing it this way instead?” And in some cases, they're right, and I do it the other way and all of a sudden everything works the way that I want it to.


Pete: Exactly. So, if you're out there and you're using QuickSight, please give us your feedback. I would be very curious to learn more, and hopefully, I can make it through now that I have a fix to my support ticket. I'm going to continue on at the wellarchitectedlabs.com site. I'm going to go through the rest of this lab here and give it a swing. And then maybe I'll come back next time and say I was wrong and here's the reason why. But thus far I think QuickSight is eh, eh, it’s fine. Like, I give it a solid meh.


All right, and with that, if you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast—like your hatred for QuickSight—please go to lastweekinaws.com/review, give it a five-star review on your podcast platform of choice, and just tell us all the terrible ways that you're using QuickSight to get an answer to your data. Thanks again.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 30 Oct 2020 03:00:00 -0700
Reader Mailbag: Savings Plans (AMB Extras)

Want to give your ears a break and read this as an article? You’re looking for this link: https://www.lastweekinaws.com/blog/reader-mailbag-savings-plans



Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 28 Oct 2020 03:00:00 -0700
Not Throwing Away My Shot!
AWS Morning Brief for the week of October 26, 2020 with Ceora Ford.
Mon, 26 Oct 2020 03:00:00 -0700
Best and Worst Ways to Incentivize Teams

Links

Transcript

Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.



Pete: Hello, and welcome to AWS Morning Brief. I’m Pete Cheslock. I'm still here; Corey is still not. I'm sorry. But don't worry, I'm here again with Jesse DeRose. Welcome back yet again, Jesse.



Jesse: Thank you for having me back. I have to say for all our listeners, I'm sorry I have not watched the entire Step Up trilogy and all the other breakdancing movies we talked about last time. It is still on my todo list. But fear not, it will happen. We will talk about this again.



Pete: Well, that actually brings a really good point, which is we need to make a correction from our last podcast. We talked about how Breakin' 2: Electric Boogaloo was the sequel for Breakin’, and I had incorrectly thought that Breakin’—the first one—also had ‘Electric Boogaloo’ in the name. It turns out I lack the ability to read an article on Wikipedia. There was a very carefully placed period in that sentence which, as our listeners probably know, delineates one sentence from another. So, no: Breakin' one, it was just called Breakin’. It was not Breakin’: Electric Boogaloo. I’m—just have no ability to read anything on Wikipedia, apparently.



Jesse: I still feel like this is a missed opportunity for the first one in the franchise to be Breakin’: Electric Boogalone.



Pete: [laughs]. Almost as bad as Electric Boogalee, but—



Jesse: It's up there.



Pete: —that's for another podcast. Anyway, we are talking today, not about breakdancing movies from the 1980s, we are actually talking about a little bit of a different change in our normal conversation, not necessarily around Amazon-specific technologies, but around fostering change within an organization, and some of the worst ways that we have seen change kind of implemented into an organization. Fostering change, it's important in any organization in general—and maybe we're a little biased; we spend so much of our time dealing with cost savings and cost optimization, but it really is so much more important when you deal with over-reaching cost optimization and, kind of, management strategy within a company.



Jesse: Yeah, I feel like there's this massive disconnect between a lot of companies, where leadership has this really, really heavy incentive—or really, really heavy goal to better understand and manage cloud costs, and the individual contributors or the underlying engineering teams just don't have the same focus. And that's not to say that they don't care about costs, so much as maybe they have other roadmap items that they're working on or other tasks that have been prioritized before cost optimization projects. So, there really seems to be this disconnect to think about cost optimization more thoroughly throughout all levels of an organization. And it ultimately makes us think about how do you go about making that change because it seems like the best way to instill the importance of cloud cost optimization and management across a company is by instilling it in the company's culture. So, today, I really want to focus on what are some of the ways that we can get the entire company to care about cost optimization and management, the same way that leadership might care about cost optimization and management. Or alternatively, if this is an individual contributor that cares, how they can get the rest of the company to care about these things and vice versa.



Pete: Yeah, that's a really good point. And we deal with a whole swath of different companies and different people at those companies, where it's kind of amazing to see how some people just inherently really care about what's being spent. And it could be for various reasons. Maybe these are people that may not have any connection to the bill or paying the bill, but more just—they just—I mean, myself, I am this person. I just hate waste. I hate waste in all parts of my life, but I really hate waste in my Amazon bill because finding out that I didn't have to spend $10,000 last month on all of those API list requests on S3 due to that bug, it just—it cuts up my soul.



Jesse: And it's really rare to find people in any organization, whether it's a client that we're working with or an organization that you work in, that are super, super invested in that kind of cost optimization work. But when you find them—I was working with one recently at one of our clients who described themselves as a super nerd about cost optimization work. And that's perfect. That's what we want. We want somebody who nerds out over this stuff, and really passionately cares about, what's it going to cost for us to make changes?



Pete: Yeah. I mean, we are two people who have focused our careers on caring about how much people spend on their bill. We're cost nerds. It's fine. It's okay to say it.



Jesse: I accept this term. I accept.



Pete: [laughs]. So, before we get to some of the good ways that we've seen to get people to care about this stuff, we want to talk about some of the worst practices we've seen. And this is broader than just cost management. This really is, what are some of the worst ways that we have been a part of seeing a company just try to affect change, whether you're a startup that's trying to pivot to the next thing, make it to the next funding round; or maybe you're an enterprise and you're just trying to go digitally native, cloud-native, multi-cloud, or something like that. The technology is not your challenge. It's not the technology is the reason why you're not going to accomplish your goal. It's always going to be the people and getting them to care about it. So, what are some ways, Jessie, that you've seen that have been particularly grinding to you?



Jesse: Yeah, if we're going to talk about incentivizing practices, I think that the big one that we need to talk about is gamifying the system where the leadership or management sets some kind of goal to say, “We want all of our IT team’s support tickets to be closed within 48 hours.” So, that's a great goal to set; that's a lovely SLA goal to work towards, but if you just set that goal blanketly, for your team, they're going to gamify the system hard. They are going to end up closing tickets as soon as they send a response, rather than waiting for the issue to be resolved or not. I've experienced this multiple times, and it drives me absolutely batty and doesn't solve the underlying problem, which is faster and higher quality support for the customer.



Pete: I'm in this picture, and I don't like it.



Jesse: [laughs].



Pete: It's true, though, one of my first, we'll call it ‘real’ jobs ever in the tech space was actually support. I was support for a SaaS product, and one of the metrics we tracked was time to close; time to resolution. And there were no incentives on this, I just was really competitive. And I would send a response and I’d closed the ticket. And for people who have worked in support before, you'll know that people react to a ticket being closed differently.



If you've ever opened a ticket with Amazon, for example, and when they respond to you, they close the ticket, and you're like, “Whoa, whoa, hold on a second.” I think they've gotten better in more recent time where they'll leave it open for a predetermined period of time, and then they'll close it automatically to obviously hit their stats. But, like, the time to close was always very questionable to me, and applying some sort of financial incentive around that, I mean, you're just going to create just the worst from people.



Jesse: And I would much rather that ticket close after a certain amount of inactivity. I would rather get the passive-aggressive automated email saying, “Hey, we haven't seen a response from you. Do you still have this issue? Do you still want us to work on this? Can you update this ticket?” versus the, “I sent you one response and now I'm going to close this ticket. Thank you for playing. Goodbye.”



Pete: Yeah, exactly. I think there are real reasons to close a ticket out. I think for most folks, you got to think about what is the metric that matters to you? And in many cases, what a previous company that I had worked at, instead of trying to aim for tickets being closed in a certain period of time, it was time to first response. It was how quickly we were able to respond to that client, not necessarily get to resolution because software support, way too complex to really nail down how quickly you can resolve an issue.



You could run into something that might take actual software engineering work to happen in the background. You might have a code change has to go out, and maybe that code change has to go in through your scrum cycle. And there's two more weeks plus some QA time and whatever. I mean, it's so hard to balance that out. So, like most things—and I think you'll find when we talk about them today is—you kind of have to understand what you're trying to incentivize for, not just applying random incentives or gamifying the system.



Jesse: Yeah, and I think it's also important to note that it's not just about what you're trying to incentivize for, but how best to incentivize—which we'll talk about some of the better ways to incentivize in a minute—but it's also important to think about, there are positive and negative ways to reinforce your goals. So, positive reinforcement, generally speaking, is going to be much more proactive, rewarding somebody who does the right thing, and negative reinforcement is more going to shame somebody who does the wrong thing, or punish somebody who does the wrong thing. Nobody wants to be punished. Nobody wants to be called out on the carpet for something that they did, whether it was intentional or accidental, and so it's a lot harder for organizations to get their employees to do what they want. It's harder for organizations to get their employees to care about cost optimization, and care about these other metrics if all they're doing is being negatively punished for not hitting the metric or not achieving the goal.



Pete: I will share one story that I don't think it's a negative reinforcement, but I guess I'll let our listeners and you, Jesse, be the judge of me on this one. But at a previous company, I created a tool that allowed you to connect into different servers within our environment because at some point you're probably going to have to log into a server—even if it's in the Cloud—and look at a log file or debug something. Like, you're just always going to be there. And we made a change to how you would connect, and functionally this was the difference between, like, dashes and underscores in how you connect it to a thing. So, I was deprecating a certain way of connecting, and so I use a helpful motivator, which is Clippy. Clippy is the mascot, I guess, of Office. The Microsoft Office mascot.



Jesse: Yeah.



Pete: And so what would happen is, if you typed in this command incorrectly using, like, a deprecated command, Clippy would pop up and say, “Hey, did you actually mean to type this instead?” And then it would pause for a second, correct your mistake, and then send off the command.



Jesse: Oh, my God.



Pete: And I didn't wait long enough to be super annoying. And it was just more like, “Hey, just a reminder, you should stop using the tool this way.” And it was the best way I can think to let a whole wide swath of people know. But then I took it one step further, and I had really, truly honest implications for this one, which I had it report a StatsD metric to a Grafana dashboard every time you did that because your username was associated with that. So, I ended up with—



Jesse: [laughs]. On no.



Pete: —a dashboard that showed who was doing it wrong. Now, in my defense, I actually used that dashboard to go to those people and just say, “Oh, hey, like, is there a way you're using this tool that you're actually running into this? How can I help you? How can I make my software better?” But when someone found this dashboard, they actually brought it up in one of our on-call meetings and thought it was a lot more negative than it really was intended to. So, if you do create a dashboard, add some context to it. Make sure that people know the purpose of that. I really did not think it was that bad.



Jesse: The intention was definitely there. The intention was so so good. Sadly, it was just taken out of context.



Pete: [laughs]. Well, then, of course, because of our hilarious—or so we thought were hilarious—jokes internally, we then started using Clippy for a bunch of different things. And anytime we deprecated something, Clippy came back again, and—



Jesse: Oh no.



Pete: We made fun of it, we made it a fun thing, but what you definitely don't want to do—and this is where that negative reinforcement is, is be publicly shaming engineers, employees, on a dashboard for things. That's one of the important parts of this is I never shared this dashboard publicly, and was like, “These five people are doing it wrong.” But I have seen scenarios where people have used those types of dashboards to rank their employees. You see it a lot in sales-type organizations, they are motivated far more with stick than carrot, I think.



Corey: This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.



Jesse: Yeah, I think that's the important thing to call out here to distinguish, that ultimately, your intention was good, and you ultimately we're trying to use this as a way to discover who might have been making those mistakes and help them, versus somebody who might be publicly sharing and then shaming these people. Because if somebody in the company, whether it's leadership, whether it's a team, whoever tries to shame others based on these kinds of leaderboard metrics, people are just going to lean into that harder and make a joke out of it entirely, about how many times can I make this mistake to get the number one spot on this leaderboard? Even though it was a negative leaderboard, per se, you're still going to try to lean into that harder if someone's going to continue to make a joke out of it, or try to make it something serious, when it clearly was meant to help people rather than shame anybody.



Pete: I mean, this is exactly why we don't pay people for lines of code, right? It's these arbitrary metrics that just don't have a lot of meaning in the real world. So, all right. We've gone through a couple, and we could fill this whole episode easily with all of the terrible ways and worst practices we've seen, or even worst practices that I've created for people that have worked for me.



But let's talk about the good things. What are some of the good ways that we have found to get people to actually care? And in this scenario, I'm going to specifically kick us off talking about, again, the cost optimization side of things. How do you get people to care about that? Because if you think about it, in a lot of ways, if someone were to come to me and say, “I need you to cut the spend on a particular service,” and I know that that could impact the availability, well, guess what?



If I'm on call, I'm not going to spend—and, you know, save the company money that's going to cause me more pain, right? That's a really bad way of coming at it. And so, maybe I'm going to share my concerns about that. Hopefully, I work with a team that actually listens to it, but there has to be a balance. So, from the manager side—as being a previous manager and managing a team—can you strike a balance between the carrot and the stick?



Now, one of the things that I had done with a good amount of success was to add, kind of, more of the human aspect to cost savings. And it was more face-to-face time with people—again, back when we could be face-to-face, which feels like a lifetime ago, pre-COVID. But it was really trying to connect with the engineers at a personal level for what they were building, how they were using Amazon, to understand what they were trying to accomplish. So, maybe I would go in and say, “Wow, I'm looking at a series of these C5 extra large instances, and their CPU is pretty much idle.” I can go to that engineer—based on some tags that we have, so there's an owner maybe—and I can go to an engineer and talk to them and say, “Hey, based on this workload, I actually think we can move over to T class instances. What do you know about this service that I don’t?” Now granted, maybe every once a while, I might be like, “Yeah, be a real shame if anything happened to those C5 extra larges there.”



Jesse: [laughs].



Pete: But, you know, it was trying to be a little bit more personal and do that. And because of my love of saving money, I developed a nickname at the company called Captain COGS.



Jesse: Oh my God.



Pete: COGS is short for ‘cost of goods sold’ because that was the metric that we cared about internally at the business; it was cost of goods sold. We were a non-profitable startup, so that's kind of a financial metric people care about. But what was interesting is that by sharing that that's the thing we cared about and different ways that engineers could help improve that number, people actually did start to care about it.



Jesse: Yeah, I think that's a really important point because what you're fundamentally getting at there is building this culture of trust and empowerment. You are trusting that the person who spun up the C5 class instances knew what they were doing when they deployed them, and you're asking them to share their context, asking, “Hey, do you have more information, more context about this than I do?” And in a lot of cases, they'll come back to you and say, “Oh, yeah, this was for a business requirement that we had to do X, Y, and Z.” Or maybe, “This was the only thing available or only thing powerful enough at the time,” or maybe, “The workload was higher at the time when they deployed the C5 class instances.” And so now that the workload has slowed down, they can move to something cheaper and better for the workload.



But ultimately, you're trusting the other person and you're empowering them to make these decisions. And I think that's honestly what this is all about. It's honestly about sharing what needs to happen with everybody. It is bringing the work individually to the people who are actually doing the work. It is sharing those goals, sharing all that information and those details with the people who are doing the work, and creates this culture of psychological safety to make mistakes and own up to them.



It's the space where you can ask those clear questions of, “Hey, did you mean to spend up that I3 instance or was that an accident? Do we actually need all this storage in io1 EBS volumes or is there something else that we can use instead?” And you're ultimately empowering them; you're empowering them to make informed decisions better in the company's best interest. You're empowering them to participate in and shape the practices and the processes that allow them to be mindful of cost during every part of the engineering process: feature development, forecasting, architectural decisions, all of it.



Pete: The important part of that, too, I think, is that context about what these costs are. You know, these are all movable levers. You can cut the cost on your Amazon usage to zero. You can just turn everything off, right?



Now, will your customers be happy? Probably not. But you can move these levers, these are all changeable levers. They're all going to be within the bounds of the business, within the context of what needs to be done. One of the really big success points that I had was working closer with product teams and the engineering teams to break down the cost of each individual feature.



So, if I had 10 features within a product, and I could break them out to say, you know, feature one represents 30 percent of our total bill. And I can work with product to then give them that insight because honestly, the most amazing thing would happen. The product teams would say, “Wait a second. That's our least used feature.” It almost always happened. It's like, the one that costs the most was the least used.



Jesse: Yeah.



Pete: And while I know that none of these changes are going to happen right away, by just dropping that little nugget to a product person, it will start to fester in their mind. And eventually, as that company matured, the cost of specific features started to show up in product planning sessions. When they would decide to refactor or change different features and things, how much of a total spend that represented would pop up. And it was that kind of democratization of that context across the business that enabled it.



Jesse: It's data-driven decision making. It's giving the necessary data to all of the players involved so that they can make informed decisions about business goals and product releases and feature releases and optimization efforts, full well knowing what making those changes might ultimately cost to the company.



Pete: Yeah, I think the other item that I always think about here is just getting people to care about how much of their stack that they're about to deploy is going to cost is hard because of that missing context. I know there's plugins for tools like Terraform that'll tell you, “You're about to provision something that's $300 right now—” or, “$5,000 right now.” Well, is that a lot of money? Like, $5,000?



Like, yeah, that's a ton of money. But what if that represented a fraction of a fraction of a percent of your total bill, right? What if that was such a small rounding error? Or what if that represented 50 percent of your bill? It's the context that matters. And that's kind of what's missing for a lot of those.



Even when trying to price out a brand new service, trying to cost-model something out. On paper, it could look wildly expensive, but in relation to maybe your engineering efforts, well sure, this is going to cost us $10,000 a year, but we're going to be able to get back, like, a whole engineering resource who doesn't have to deal with the broken thing anymore, right? Those trade-offs and those decisions, they have a lot more impact when you grab that data, and when you really ask those questions.



Jesse: Absolutely. And I think that's so critical to be able to look at all of the pros and cons of any business decision, both in terms of the actual costs of building something that AWS or your cloud provider might charge you, and then looking at the hidden costs in terms of engineering effort; in terms of other resources, whether it's a third party resource like another monitoring or observability tool, or other infrastructure resources. It's important to look at all those resources together in order to make your decision.



Pete: Well, I think we could fill plenty of additional Whiteboard Confessional podcasts with more of these, worst of the worst and best of the best ways that you've seen on fostering change in your organization. Shoot us a message on Twitter @lastweekinaws and we'd love to hear. What have you seen work really well? What have you seen that has not worked as well? All right, Jesse, again, thank you for joining me, otherwise, it would just be myself, talking to myself, just me, myself and I.



Jesse: If it was just you talking to yourself, we never would have gotten off the Breakin’ rant that started last week, and we would still be here talking about breakdancing movies.



Pete: It'd be 30 minutes in and I would be misquoting Wikipedia articles.



Jesse: I'm happy to help—well, actually you—anytime.



Pete: [laughs]. Thanks again, Jesse. So, if you've enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice, and tell us some of the worst ways that you have seen change done in an organization. Thanks again.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 23 Oct 2020 03:00:00 -0700
Reader Mailbag: Potpourri (AMB Extras)

Want to give your ears a break and read this as an article? You’re looking for this link: https://www.lastweekinaws.com/blog/reader-mailbag-potpourri



Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 21 Oct 2020 03:00:00 -0700
Don't Interrupt Me... Last Week In (A)s I (W)as(S)aying
AWS Morning Brief for the week of October 19, 2020 with guest host Brianna McCullough.
Mon, 19 Oct 2020 03:00:00 -0700
AWS Cost Anomaly Detection 2: Electric Boogaloo

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Transcript
Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.


Pete: Hello, and welcome again to the AWS Morning Brief: Whiteboard Confessional. Corey is still enjoying some wonderful family time with his new addition, so you're still stuck with me, Pete Cheslock. But I am not alone. I have been joined yet again, with my colleague, Jesse DeRose. Welcome back, Jesse.


Jesse: Thank you for having me. I will continue to be here until Corey kicks me back off the podcast whenever he returns and figures out that I've locked him out of his office.


Pete: We'll just change all the passwords and that'll just solve the problem.


Jesse: Perfect.


Pete: What we're talking about today is the “AWS Cost Anomaly Detection, Part Two: Electric Boogaloo.”


Jesse: Ohh, Electric Boogaloo. I like that. Remind me what that's from. I feel like I've heard that before.


Pete: Okay, so I actually went to go look it up because all I remembered was that there was, like, a movie from the past, “Something Two: Electric Boogaloo,” and I dove to the internet—also known as Wikipedia—and I found it it was a movie called Breakin’ 2: Electric Boogaloo], which is a 1984 film. And it says it's a sequel to the 1984 breakdancing film Breakin’: Electric Boogaloo, which I thought was kind of interesting because I always thought of that joke ‘Electric Boogaloo’ was as related to the part two of something, but it turns out it's not. It's actually can be used for both part one and part two.


Jesse: I feel like I'm a little disappointed, but now I also have a breakdancing movie from the ’80s to go watch after this podcast.


Pete: Absolutely. If this does not get added to your Netflix list, I just—I don't even want to know you anymore.


Jesse: [laughs].


Pete: What's interesting, though, is that there was a sequel called Rappin’, which says, “Also known as Breakdance 3: Electric Boogalee.”


Jesse: Okay, now I just feel like they're grasping at straws.


Pete: I wonder if that was also a 1984 film. Like, if all of these came out in the same year. I haven't looked that deep yet.


Jesse: I feel like that's a marketing ploy, that somebody literally just sat down and wrote all of these together at once, and then started making the films after the fact.


Pete: Exactly. One last point here, because it's too good not to mention, was that it basically says that all these movies, or at least the later one, had an unconnected plot and different lead characters; only Ice-T featured in all three films, which then got me to think a sec—wait a second, Ice-T was in this movie? Why have I not watched this movie?


Jesse: Yeah. This sounds like an immediate cult classic. I need to go watch this immediately after this podcast; you need to go watch this.


Pete: Exactly. So, anyway, that's the short diversion from our, “AWS Cost Anomaly Detection, Part Two” discussion. So, what did we do last time? Why is this a part two? Hopefully, you have listened to our part one. It was, I thought, quite amazing—but I'm a little bit biased on that one—where we talked about a new service that was very recently announced at Amazon called AWS Cost Anomaly Detection.


And this is a free—free service, which is pretty rare in the Amazon ecosystem—that can help you identify anomalies in your spend. So, we got a bit of a preview from some of the Amazon account product owners for this Cost Anomaly Detection, and then we got a chance to just dive into it when it turned on a few weeks ago. And it was pretty basic.


It's a basic beta service—they actually list it as beta—and the idea behind this is that it will let you know when you have anomalies in your cost data, primarily increases in your cost data. I remember specifically talking that it was specifically hard to identify decreases in spend as an anomaly. So, right now it only supports increases. So, a few weeks ago, we went into our Duckbill production accounts, turned it on, and we were just waiting for anomalies so that we could do this.


Jesse: I also think it's worth noting that I'm actually kind of okay with it being basic for now because if you look at almost any AWS service that exists right now, I would say none of them are basic. So, this is a good place to start and gives AWS opportunities to make it better from here without making it convoluted or difficult to set up in the first place.


Pete: A basic Amazon service, much like myself.


Jesse: [laughs].


Pete: So, guess what? We found anomalies. Well, we didn't find them. The ML backing Cost Anomaly Detection found some anomalies. So, that's what we're here to talk about because now that we actually have some real data, and real things happened, and we actually dove into some of those anomalies, interestingly enough. So, that's what we're here to talk about today.


Jesse: It's also probably worth noting that we changed our setup a few times over the course of kicking the tires on this service, and unfortunately, we weren't able to thoroughly test all of the different features that we wanted to test before this recording. So, we do still have some follow up items that we'll talk about at the end of this session. But we did get a chance to look at the majority of options and features of this service, and we'll talk about those today.


Pete: So, if you remember—or maybe you don't because you didn't listen to the last episode we did—we configured a monitor, is what it's called, that will analyze your account based on a few different criteria. And the main one is, I think it will just look at the different AWS services across your AWS service monitor. And you can have it go look at specific accounts, look at specific cost allocation tags, there's a whole type of setup that you can do for these alerts. And the only real configuration choice that you have to make is an alert threshold. And this was something that took us a little bit to kind of understand, and I think we both really understand it a lot better now. And we made a change, right? Like, what we thought it was, wasn't totally what it was.


Jesse: Yeah, initially, this was a little bit confusing for me, and it took us a while to wrap—it at least took me a while to wrap my brain around the difference between the anomaly itself and the alert threshold. Effectively, the anomaly can be any dollar amount: it can be any amount of spend over the basic amount of spend that's expected for that particular service or that particular monitor that you've enabled, but the alert threshold itself is just the threshold at which we want to be alerted if the anomaly itself is over that threshold. So, in our case, when we first enabled the service, we set the alert threshold at $10. But all the anomalies that we saw were much lower than that. They were all about $1 apiece. So, we never got alerted to those anomalies, even though we did log into the console and see those anomalies.


Pete: Yeah. I think that's really the key thing is the alert threshold is truly that.that. It is: when you get an anomaly, at what spend level it identified do you want to receive an alert? And the alerts that it can generate are, kind of, real-time where you can have it notify to an SQS endpoint. Our configuration had an SQS go to AWS Chatbot which will drop a message into our Slack.


We reduced that, as Jesse said, down to $1 because I still kind of want to see what it looks like when it shows up in Slack. So, hopefully, we'll see that in a few weeks, or at, just, some point in the future. But then you can also have it just send, I think, daily—you can send these summaries daily or even weekly. I'm not sure if there was a monthly option. Maybe, Jesse, you remember that one. But—


Jesse: Yeah, it looked like there was just daily and weekly for now.


Pete: So, this came back to one of our original gripes, which was it didn't seem like you could create multiple anomaly alerts. So, for example, I might want to have the real-time stuff going into Slack just by default, but then maybe there's someone in my finance department, or my VP of engineering who's not in that Slack channel, doesn't really want the noise. They want to get the weekly one. It didn't look like there was a way to do that, and I think that's where we ran into an issue last time. We specifically got an alert—or an error for setting that up.


Jesse: Yeah, and it's also worth noting that the error itself was rather vague. It said that we couldn't enable this alert but didn't tell us why, or how, or what part of the walkthrough was erroring out. And I can see that this would be really beneficial to allow individual contributors to see their spend alerts repeatedly in Slack, whereas somebody that's much higher up doesn't need that level of noise. So, they just want the report at the end of the day or the end of the week to know what's going on with their teams.


Corey: This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.


Pete: There is definitely a way, hopefully in the future when they enable this but, for some organizations, if I can set up these anomalies with various cost allocation tags and I have different product teams, or product owners, or business units or whatever, and I can notify these teams in some sort of reasonable way, giving them maybe a heads up, like, “Hey, here's some anomalies,” that could be super powerful. So, again, right now, it doesn't look like it, but again, it's beta and also, it's free. So, can we really—I mean, yes, we can, of course, complain that much about it. But still. [laughs].


Jesse: Similar to what we said last week, I am thrilled that this service exists, even if there are things that we want from it. All of the questions that we have, all the content that we have from the previous session and this session, these are all things that are wishlist items; these are all things of, ways that this service can improve, but in no way is critiques of the existing service itself. There's definitely lots of room for improvement.


Pete: Yeah. If you would like to hear more of our critiques of a service, just stay tuned for our future QuickSight product deep dive because—


Jesse: Oh God. Don't even get me started. My eyes are already twitching.


Pete: [laughs]. Don't worry, QuickSight team, we still love you. All right, so we found an anomaly, and the anomaly pointed us towards a root cause which, that is a bit of a charged word now in the technical ops communities, right?


Jesse: Yeah, there's a lot of pushback against the phrase ‘root cause’ in the industry because in most cases—I’m going to butcher this really poorly, and there's many other sources that talk about this more clearly, but in most cases, there's not a single root cause. There's multiple contributing factors to any event. So, using this phrase is kind of frustrating for me, and I really wish that they had talked about these potential causes for the anomalous spend as such, as ‘potential causes’ rather than a ‘root cause.’


Pete: And this feels like—I am not an English major, and I have not studied this area in-depth, but kind of it feels like the term ‘contributing factor’ would work here, that you could call these anomalies a contributing factor to this cost anomaly.


Jesse: But again, it's also worth pointing out that we appreciate that this service highlights what it expects is the potential contributing factors. So, rather than just saying, “Hey, your spend for a particular service or particular monitor is up this much money.” It's actually pointing you to, potentially, what is causing that. So, there's definitely good coming out of this feature, but we just wish that it was renamed, which should be a simple request.


Pete: We say that but, of course, having no idea what it takes to rename—[laughs]. But as you said, and so it identifies the anomaly in some way. And for us, we actually saw an anomaly related to S3. And this was actually related to a lot of other anomalies we found that were centered around our Athena and QuickSight usage, which is why our QuickSight wounds are still a little fresh. The S3 anomaly it identified for us a couple of weeks ago, and when you go and dive into this anomaly, it will provide you a link that will take you directly to Cost Explorer.


And I really like this feature, by taking you to Cost Explorer, so that you can see everything broken out, every filter included in. It even includes something called ‘usage type,’ which if you're a Cost Explorer newbie, ‘usage type’ is, kind of, the billing code or the specific usage. So, in our scenario, it was a requests tier two usage type, that is, a class of requests for S3 that include gets, and puts, and things of that nature. And so we saw this much higher number of those tier two requests to S3, that could help us identify a little bit more about what was causing this.


Jesse: And it's also worth noting that when you receive any anomaly detection, the service also gives you the opportunity to train the machine learning model: there's a little button up at the top that says, “Submit assessment.” And it says, basically, “Hey, did you find this detected anomaly to be helpful?” And you can say, “Yes, it was an accurate anomaly,” or, “No, it was a false positive.” Or, “Yes, it was an anomaly, but we expected it.” Which will ultimately help the model better understand what your spend looks like over time, where to expect anomalies, and how better to alert you, as an AWS customer, about your spend in the future.


Pete: Honestly, I think this was a missed opportunity by the Cost Anomaly Detection team to not create an Amazon version of Clippy, where they could just have something pop up that's like, “I see you're trying to report this assessment of this anomaly.” I don't know what the character would be; I'm not very creative, but I think this was definitely a missed opportunity. So, I think they should grab some of those creative minds at Amazon and toss them at this problem.


Jesse: [laughs]. I would love to see an AWS Clippy. I'm going to start a hashtag on Twitter for AWS Clippy.


Pete: We'll get to work on that. I think another thing that we found that is definitely an area for improvement, especially if you have a lot of Amazon accounts, is when it does report the anomaly and it takes you in to where you can, kind of, view it in Cost Explorer, or report it, kind of, one level down, it'll list the region, the service, the account, but it lists the account by account number—


Jesse: Yeah.


Pete: —when we've got—what, like, I think we have a handful of accounts, maybe less than 10, but how does that work when you've got hundreds or more?


Jesse: Yeah, I feel like this is a missed opportunity. I don't expect anybody to remember their linked account numbers for a single account, let alone for multiple or hundreds of accounts. And I really wish that there was a way that the service could tie the account number directly to the name on the account, or maybe the meta-name on the account, depending on how the account is set up. Something that gives the user who's looking at this information a little bit clearer data to dive into, and a little bit clearer opportunity to know, “Okay, I'm looking at this particular service in this particular account.” And rather than saying, “Oh, the account number is XYZ,” it is the production account, or it is the development account, or it is our security accounts really clearly off the bat.


Pete: Look, I'm not a network administrator in the early aughts. I don't have numbers memorized, like IP addresses and such. So—


Jesse: Oh, don't even get me started.


Pete: I still have some of those IP addresses memorized. It's sad. I mean, meanwhile, I can't remember my kids’ birthdays, but yeah, the IP address from 2002 DNS server that I used to use, still burned in there, right? So, yeah, I think it was helpful, friendly names that could exist.


And these connections exist, right? Your billing console will have that context. If your bill just gave you a bunch of account IDs and your total spend, I think many companies would have a hard time figuring out which one was the account ID and which one was the amount of spend because the numbers are so big in both cases. But adding a little bit more context, I think would be helpful.


Jesse: I'm also really curious what sets the severity level for each anomaly. So, when you log into the anomaly dashboard, you see the table of recent anomalies. And each row—each anomaly—has a severity level associated from low to medium to high. I'm really curious what sets that severity level. We couldn't find this going through the existing documentation or in the walkthrough wizard. I'm not sure if it's something that I'm missing in the documentation, or if it isn't clearly documented yet.


Pete: Yeah, it’s a good point. We can make some guesses. Are they higher spend numbers than lower? Are they numbers and spends that are closer to our alert threshold? It's really hard to say for us but definitely would love a little bit more insight in the documentation there.


The other thing, too, that I think we—kind of waiting for more anomalies is—we don't really have a good answer for is how quickly do the alerts show up when this anomalous spend happens? As everyone probably knows, your spend within Amazon is laggy. It could be measured in hours—or even days—for some charges to post through. So, how long does it take for this anomalous spend to occur before the ML picks it up? That's something where I think just as more people use it, as we use it some more, we can kind of see some examples.


And hopefully, with our threshold being set so low for alerting, when we see a message dropped into our Slack channel, we can go analyze it right away. I think it'll be pretty cool to see how quickly that happens. If for some reason we are messing around with Athena, and S3, and QuickSight again, are we creating new anomalies? And how quickly from the time that I might be messing around in those services to the alert being posted? If that's measured in hours, like that could be pretty interesting.


Jesse: And this goes back to one of my previous comments, which is we still have lots of digging to do on this service. We've got multiple products leveraging AWS, so we definitely want to enable a monitor for each of our product tags so we can get a clearer idea of spend by product. And we want to dig into some of these anomalies that we already have seen; we want to dig into them further, we want to better understand where they're coming from and what's causing them, and we want to make sure that the Slack integration, or the SNS integration, is working as expected, that we can receive these alerts clearly and effectively, and just really continue testing all of what we're seeing so far.


Pete: So, Jessie, are you saying that there might be a, “AWS Cost Anomaly Detection, Part Three: Electric Boogalee?”


Jesse: Oh, my God, don't even get me started. I'm sorry, folks. I'm just going to put in my resignation now.


Pete: All right. Well, thanks again, Jesse, for taking us through AWS Cost Anomaly Detection, and all the fun stuff we found. Really appreciate that.


If you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell us, how many anomalies did you find? Thanks again. Bye-bye.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 16 Oct 2020 03:00:00 -0700
Reader Mailbag: Accounts (AMB Extras)


Links Mentioned




Sponsors




Never miss an episode




Help the show




What's Corey up to?


Wed, 14 Oct 2020 03:00:00 -0700
Snark Interrupted
AWS Morning Brief for the week of October 12, 2020 with guest host Veliswa Boya.
Mon, 12 Oct 2020 03:00:00 -0700
The Cloud is Not Just Another Data Center (Whiteboard Confessional)

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links


Transcript
Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.


Pete: Hello, and welcome to the AWS Morning Brief: Whiteboard Confessional. I am again Pete Cheslock, not Corey Quinn. He is still out, so you're stuck with me for the time being. But not just me because I am pleased to have Jesse DeRose join me again today. Welcome back, Jesse.


Jesse: Thanks again for having me.


Pete: So, we are taking this podcast down a slightly different approach. If you've listened to the last few that Jessie and I have ran while Corey has been gone, we've been focusing on kind of deep-diving into some interesting, in some cases, new Amazon services. But today, we're actually not talking about any specific Amazon service. We're talking about another topic we're both very passionate about. And it's something we see a lot with our clients, at The Duckbill Group is people treating the Cloud like a data center.


And what we know is that the Cloud, Amazon, these are not just data centers, and if you treat it like one, you're not actually going to save any money, you're not going to get any of the benefits out of it. And so there's an impact that these companies will face when they choose between something like cloud-native versus cloud-agnostic or a hybrid-cloud model as they adopt cloud services. So, let's start with a definition of each one. Jessie, can you help me out on this?


Jesse: Absolutely. So, a lot of companies today are cloud-native. They focus primarily on one of the major cloud providers when they initially start their business, and they leverage whatever cloud-native offerings are available within that cloud provider, rather than leveraging a data center. So, they pay for things like AWS Lambda, or Azure Functions, or whatever cloud offering Google's about to shut down next, rather than paying for a data center, rather than investing in physical hardware and spinning up virtual machines, they focus specifically on the cloud-native offerings available to them within their cloud provider.


Whereas cloud-agnostic is usually leveraged by organizations that already use data centers so they're harder pressed to immediately migrate to the Cloud, the ROI is murkier, and there's definitely sunk costs involved. So, in some cases, they focus on the cloud-agnostic model where they leverage their own data centers, and cloud providers equally so that compute resources run virtual servers, no matter where they are. Effectively, all they're looking for is some kind of compute resources to run all their virtual servers, whether that is in their own data center, or one of the various cloud providers, and then their application runs on top of that in some form.


Last but not least, the hybrid-cloud model can take a lot of forms, but the one we see most often is clients moving from their physical data centers to cloud services. And effectively, this looks like continuing to run static workloads in physical data centers or running monolith infrastructure in data centers, and running new or ephemeral workloads in the Cloud. So, this often translates to: the old and busted stays where it is, and new development goes into the Cloud.


Pete: Yeah, we see this quite a bit where a client will be running in their existing data centers, and they want all the benefits that the Cloud can give them, but maybe they don't want to really truly go all-in on the Cloud. They don't want to adopt some of the PaaS services because of fear of lock-in. And we're definitely going to talk about vendor lock-in because I think that is a super-loaded term that gets used a lot. Hybrid-cloud, too, is an interesting one because some people think that this is actually running across multiple cloud providers, and that's just something we don't see a lot of. And I don't think there are a lot of clients, the companies out there running true multi-cloud, I think is the term that you would really hear.


And the main reason I believe that not a lot of people are doing this, running a single application across multiple clouds is that people don't talk about it at conferences. And at conferences, people talk about all the things that they do when in reality, it's so wishful thinking. And yet no one is willing to talk about this kind of, oh, we're multi-cloud in like, again, kind of, singular application world. So, one thing we do see across these three, you know, models, at a high level, cloud-native, agnostic, hybrid-cloud, the spend is just dramatically different. If you were to compare multiple companies across these different use cases. Jessie, what are some of the things that you've seen across these models that have impacted spend?


Jesse: I think first and foremost, it's really important to note that this is a hard decision to make from a business context because there's a lot of different players involved in the conversation. Engineering generally wants to move into the Cloud because that's what their engineers are familiar with. Whereas finance is familiar with an operating model that does not clearly fit the Cloud. Specifically, we're talking about CapEx versus OpEx: we're talking about capital expenditures versus operating expenditures. Finance comes from a mindset of capital expenditures, where they are writing off funds that are used to maintain, acquire, upgrade physical assets over time.


So, a lot of enterprise companies manage capital expenditure for all the physical hardware in their data centers. It's a very clear line item to say, “We bought this physical hardware; it's going to depreciate over time.” But moving into the Cloud, there is an operating expenditure model here instead, which focuses on ongoing costs for running a product because any cloud provider is going to charge you an on-demand price by default. You're rarely going to pay an upfront cost with any new service that you run in any cloud provider.


Pete: Yeah, I think that's a really good point, which is the model flips on its head, which is why it really trips up a lot of companies. In the old days, which for some companies really isn't that old, you have some servers that you purchased, maybe three to five years ago. From an accounting standpoint, they're fully depreciated, which means they're not really costing anything; they probably have no value. Which means they could sit idle, it doesn't cost the business anything. But if you spin up EC2—and if we use EC2 as an example, running an EC2 instance that is not at 100% CPU—that's not absolutely maxed out on its resources, whether it is CPU or memory—you are wasting money.


Jesse: Absolutely.


Pete: And again, with the exception of T classes, and there's a lot of other interesting ways around it. But by running everything on EC2 which we see when companies either adopt the Cloud or just due to architecture reason, there's a lot of hidden costs within there. There's a lot of the waste within EC2 that you can't get when you architect for ephemerality. Like when you can architect to use services like Lambda, Functions as a Service, or services like Fargate, where you can just run a container for a period of time. I think one of the bigger cost areas of EC2 comes to operational overhead that no one ever thinks about; no one ever considers the people involved in operating and managing all of the complexity around your multiple EC2 servers.


Jesse: Absolutely.


Pete: And running your application on EC2, running a database on top of EC2, and then your application on top of that database, so many levels within there. And we see it a lot, too, as people start to, for some reason, deploy their own Kubernetes to a cloud provider and then deploy an application on top of it. They're just adding the complexity on top. But there's actually another thing, too, that Jesse, I know you dive into a lot and you see with our clients with some of the hidden costs on EC2. What is that?


Jesse: It's data transfer. And it's absolutely phenomenal to see because moving from a data center world, data transfer is completely free in most cases. The network traffic between physical servers in a single data center doesn't cost anything, and there may be some costs involved with bandwidth to and from a given data center, but for the most part that data transfer is free. Whereas, again, in the OpEx model with on-demand spend in the Cloud world, data transfer costs you, for lack of a better phrase.


Think about your Kafka workloads. Think about your Cassandra and MongoDB workloads. Think about any of your distributed managed services that are running in your ecosystem: those require lots of replication traffic in order to run effectively. And that traffic isn't free if you run the services on top of EC2 instances. You're going to be charged for every bit of traffic that runs between nodes within availability zones—or across availability zones, and across regions. So, you're paying for a lot of data transfer upfront for these services.


Pete: Yeah, I remember talking with an Amazon account manager many years ago, and they said to me, “Oh, I can just look at your bill and tell you if you're running Cassandra or not,” because in deploying Cassandra, you're going to have to replicate your rights across multiple availability zones—I mean, if you care about your data—and there's a cost to replicate across availability zones within Amazon. And that's just something that people don't think about when they're running in their own data center. Sending data out to the internet, sure, has a cost. You know, the peering traffic, and things of that nature. But once you're inside a rack of servers, or multiple data centers, even, if you own those connections, you can just send whatever data you want.


And on paper, a lot of our clients look at the cost of Amazon managed services like ElasticSearch, they may look expensive. The Amazon ElasticSearch, Amazon's DocumentDB, Amazon's Aurora, things like that; these may look expensive when you compare to, “Well, I could just run it on EC2 myself.” But the thing you're missing, and it's not clear in a lot of ways, is the replication traffic cost for Amazon to replicate your data across multiple availability zones for durability reasons; there's just no cost there. It's essentially baked into what you're going to pay for data storage. And so, it's a cost that is hidden that people don't even think about.


Jesse: It's something that no engineering team that I have worked for, or talked to before in my career has thought about. They very much focus on the upfront numbers that are on paper on each cloud provider’s website, and they don't factor into their conversations. What is the overhead of data transfer? What is the overhead of engineering effort to manage this new infrastructure? And they don't think about it when they look at new features, new product offerings, all of those things need to be thought about when discussing a new product or a new feature offering. It's really important to make sure that cost is part of that conversation, and that cost is not just the price on the website, but the various components of the architecture that are going to combine to give you your overall architecture.


Pete: Exactly.


Corey: This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.


Pete: Let's move on to one of my favorite things ever. I love to talk to people about this, and mostly just rant about it. It's vendor lock-in. It's that term that you hear all the time that usually drives an ill-conceived architectural decision. “Oh, we can't do that. We don't want to be locked into that vendor.” But it’s o—


Jesse: I hate to break it to you, but almost no matter where you go, you've got vendor lock-in.


Pete: You're locked into so many decisions that you have no control over. Let's just say you get locked into Amazon, and you're on Amazon Web Services, and for some reason, you think you’re vendor-locked-in to Amazon Web Services. But what does your application run on? Does it run on Cassandra? You're locked into Cassandra. Does it run Mongo? Well, you've got some vendor lock-in there.


You could say, “Oh, well, these are open source solutions. I can change at any time.” Okay. I'll come back to you in two years, and I'm going to look at those existing databases that are still running Postgres, Cassandra, Mongo, whatever. You're locked into those things, and it's not a big deal. It's just something that—don't throw the vendor lock-in boogeyman and scare any sort of reasonable improvement in your infrastructure.


Jesse: And it's also worth noting that if you are already on a specific cloud provider, and talk about moving to another cloud provider because you're worried about lock-in, talk to your engineers first because in most cases, they already have skill sets for whatever cloud provider you're on, and if you move to another cloud provider, they may or may not stay with you. The learning curve may be astronomically high for them to move all of your infrastructure to a new vendor.


Pete: Yeah. That is one of the biggest points. You're really locked in by your ability to hire the expertise for the specific cloud provider you're on. And if you have a lot of engineers who are experts in Amazon Web Services, and you go to them—like Jesse said—and you go to them and say, “Yeah, we signed a deal with Azure and we're going to move there,” or, “We signed a deal with Oracle Cloud; we're going to move there.” My guess is, before you finish that sentence, half of those engineers are on job boards looking for their next move because for a lot of them, they have their own lock-in, right?


Their own sunk cost in learning all this stuff around Amazon. They may want to work in that ecosystem, so you could lose a lot of your engineers by the choice you make, depending on where you go. And this also counts for people who are moving into the Cloud. If you're in data centers now and you're moving to the Cloud, the choice you make isn't always who will give you the best deal. It’s, how can I retain my staff as well, right? That's a big part of the lock-in that, again, people don't even think about. No one thinks about the people side, Jesse, I don't understand it.


Jesse: It drives me crazy.


Pete: [laughs]. So, we've talked a little bit about, you know, those lift and shift, right. The enterprise folks that are running their data centers, they want to get in the Cloud, and the model you hear most often is lift and shift. Pick up your application as it exists and, kind of, drop it down there.


And those clients in many ways, follow this model we talked about, right? They go and spin this up on EC2, and they go and deploy their application. And that's fine, actually. That's a smart move, and that's actually a recommended move by most cloud providers. Just bring your service over; get it into the Cloud as soon as possible, then re-architect. But there's actually a great blog post out there from A Cloud Guru talking about the Lift and Shift Shot Clock. What was this concept that they talked about in that blog post, Jesse?


Jesse: The ‘lift and shift shot clock’ is something that I think every enterprise faces at some point. Every company that we've talked to that says they are either in a quote-unquote, “Hybrid model,” or in the enterprise data center space moving into the Cloud, the lift and shift shot clock is the time it takes after you lift and shift, and don't update applications and you lose your engineers. Effectively, you are counting down from that lift and shift. Once you have moved all of your infrastructure into the Cloud—whether it's AWS or another provider—and you don't then migrate to native services that that cloud provider offers.


If you don't make that move, you're effectively keeping your feet in two different places. You are focusing on data centers, and you are focusing on your cloud provider. And it becomes harder and harder for existing and new engineers to know, where do I deploy something? Do I deploy it to the data center? Do I deploy it to the Cloud?


We ran into this with a previous client where multiple different product offerings existed across both their data centers and the Cloud. And the teams that were managing these offerings, didn't know where to deploy their work because they didn't know, is everything supposed to be moving towards the Cloud? Or are we moving back to the data centers? Or are we splitting it between the two? There wasn't a clear business decision and roadmap saying, “Hey, this is the way that we need to move.”


It was really, really important for leadership to effectively point in one direction and continue to march forward. And not just leadership, but ultimately this needed to be a grassroots effort as well. It's really important for everybody in the company to be involved in this conversation to make sure that once a company decides to move from a data center into the Cloud, that they flesh out all parts of the migration and make that final step from lifting and shifting to cloud-native offerings.


Pete: Yeah, you can't just lift and shift over and 12, 18, 24 months later—


Jesse: Right.


Pete: —still have all those systems there. You need to start adopting all of the benefits of the Cloud: the ephemerality, and all the different PaaS services that are arguably providing you a service much better than you could build yourself. And once you get two, three, four years out, you're just going to have this drain of talent as people don't want to deal with the old busted thing that has just been carried over to the new environment. Really leverage those engineers to adopt those technologies. And there's a lot of benefit there.


Jesse: I think it's also really important to note that this takes work. No matter which direction you go, migrating into the Cloud and moving things into the Cloud, or moving them from, maybe, EC2 instances into managed service offerings, is going to take work; this is not something that is going to be extremely easy. But the reward is absolutely worth the effort. You're absolutely going to get benefits from migrating from software that is running on virtual machines into a ephemeral service like AWS Fargate, or Lambda Functions. You are absolutely going to get benefit from this and receive ROI. But it is going to take work.


Pete: Exactly. And the main reason that we mention this—we recommend it—really all just comes down to having a single cloud provider. And we could probably fill another Whiteboard Confessional on why you should only have one cloud provider, but by choosing one single cloud provider, you remove a lot of the complexity that exists in trying to do multi-cloud, which no one really is doing, as we talked about earlier. But the biggest part is you have actually a much bigger position to negotiate for better discounts by having just one provider. By adopting, in Amazon's example, their PaaS services versus just EC2 you can negotiate for service-specific discounts that can actually make the cost of those PaaS services a lot more aggressive, and maybe the delta isn't as big as you were thinking.


Only in some very rare cases, if you're thinking to yourself, “Well, I'm negotiating a new discount program with Amazon,” or a new discount for this or that or whatever, “—and I'm just going to go to them and say ‘Well, I'm thinking that I might move all my infrastructure to Google or to Azure.’” unless you can actually move over your data and your workloads in, like, weeks—which most people really can't do; they don't have the capability of doing that—it's a really idle threat to threaten like, “Oh, I’m going to move over to this other cloud provider.” It's just too much of a lift to actually accomplish. So, because that lift is so high, that level of effort is so high, focus on trying to get the most out of what your cloud vendor is providing you, whether it's Amazon, Azure, Google, whoever, try to adopt as many of their PaaS services as possible that can help you move a lot faster, you don't have to worry about scaling up Cassandra because you can just use a PaaS service. It's not all roses; there's definitely reasons why using those PaaS services could be a big pain to the environment, maybe you're losing some visibility, losing the ability to maybe run the latest version, but from at least a pure cost perspective—and when you think about the overhead of the people—it is a lot less expensive. You don't have to send those engineers off to go and run the databases themselves. And you can get a lot of other benefits from there as well.


Jesse: It's also worth noting that your cloud account team wants to have these conversations with you. If you show that you are invested in their platform, their provider, their service, they will absolutely invest in you as well. They will provide benefits, they will provide discounts, they will have engaging conversations with you to figure out the best ways that you can receive discounts based on the amount of traffic, or compute resources, or usage that you have on their platform. So, don't be afraid to reach out to your account team and start these conversations, especially if you are planning to move more resources into your cloud provider. They absolutely want to have this conversation with you and they are open to having this conversation with you.


Pete: Yeah, for those folks out there that paid big money for enterprise support, use it; it's there; you pay that money for a reason. Reach out to your account manager, your technical account teams. Does not matter the vendor you're with. All cloud vendors should have an account team, you know, especially if you have a reasonable amount of spend.


And like Jesse said, talk to them. They want your business and they want to help you. They want you to feel like you're getting value for what you spend. But what we can say is definitely adopt all of the great things that the Cloud provides. If you treat it like just another data center, you're just going to end up with a lot of inherent waste in the system.


Well, Jesse, thanks again, for joining me for this rant about Cloud is not a data center.


Jesse: Thank you as always, for having me. I am always happy to rant about anything Cloud-related, especially in this context.


Pete: That's one thing that we always agree on is ranting about Cloud is a lot of fun, especially when you spend so much time in it like we do. If you've enjoyed this podcast, please go to lastweekonaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekonaws.com/review, give it a five-star rating on your podcast platform of choice and tell Corey congrats on the new addition to his family. Hopefully, he'll be back in, I don't know, a few more weeks from his paternity leave. But until then you are stuck with us. Thank you.


Announcer: This has been a HumblePod production. Stay humble.


Fri, 09 Oct 2020 03:00:00 -0700
Reader Mailbag: AWS Services (AMB Extras)

Links Mentioned

Sponsors

Never miss an episode



Help the show



What's Corey up to?

Wed, 07 Oct 2020 03:00:00 -0700
No Hateration or Holleration in this Dancery
AWS Morning Brief for the week of October 5th, 2020 featuring guest host Angela Andrews.
Mon, 05 Oct 2020 03:00:00 -0700
Turn on AWS Cost Anomaly Detection Right Now—It’s Free (Whiteboard Confessional)

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Transcript


Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.



Pete: Hello and welcome to the AWS Morning Brief: Whiteboard Confessional. Corey is still not back. Of course, he did just leave for paternity leave, so we will see him in a few weeks. So, you're stuck with me, Pete Cheslock, until then. But luckily, I am joined again by Jesse DeRose. Jesse, thanks again for joining me today.



Jesse: Thank you for having me. You know, I have to say I love recording from home. I can't see the look in our listeners’ eyes as they glaze over while we're talking. It's absolutely fantastic.



Pete: It's fantastic. It's like a conference talk, but there's no questions at the end. It's the best thing ever.



Jesse: Yeah, absolutely. I love it.



Pete: All right. Well, we had so much fun last week talking about a new service. Although it turns out it was new to us. It was the AWS Detective—or Amazon Detective. There's still some debate about what the actual official name of that service is. For some reason, I thought that service came out in the summertime, but it turns out it was earlier in the year. So, still a great service, AWS Detective—or Amazon Detective, whichever way you go with that one—but we had such a fun time talking about a new service that we had the opportunity of testing out an actual brand new service. This was a service that was just announced last Friday. And that's the AWS Cost Anomaly Detection service. Jessie, what is this service all about?



Jesse: So, you likely would notice if your AWS spend spiked suddenly, but only the really, really mature organizations would be able to tell immediately which service spiked. Like, if it's one of your top five AWS Services by spend, you'd probably be able to know that it's spiked, you'd probably be able to see that easily in either your billing statement or in Cost Explorer. But what if you're talking about a spike in a much smaller amount of spend, that's still important to you, but it's a service that you don't spend a ton of money on: it's a service that is not a large percentage of your bill. Let's say you use Workspace, and you only spend $20 a month on Workspace. You ultimately do want to know if that spend spikes 100 percent or 200 percent, but overall, that's only maybe $20 on your bills. So, that's not something to see very easily unless it spikes exponentially.



So, the existing solutions for this problem require a lot of hands-on work to build a solution. You either need to know what your baseline spend is in the case of AWS Budgets, or you need to perform some kind of manual analysis via custom spreadsheets or business intelligence tools. But AWS Cost Anomaly Detection kind of gets rid of a lot of those things. It allows you to look at anomalous spend as a first-class citizen within AWS.



Pete: Yeah, the other trick too, with this anomalous spending—and I've gotten really good at learning how to spell ‘anomaly’ because I've always spelled it very wrong my entire life, but in just writing the preparatory material for this, the number of times I spelled anomaly has really solved that problem for me. Now, sometimes those mature organizations, they might see that anomalous spend, maybe the day after, maybe the week after, but I've been a part of organizations who they see that spend when the bill comes. That's actually pretty common. You're not an outlier if you only identify these outliers in spend when your bill arrives. And that outlier in spend could be something like, “Wow, we changed a script, and we're doing a bunch of list requests, and wow, we're that $8,000 come from?” or, “We're testing out Amazon Aurora and we did a lot of IOs last weekend, and our estimated bill is going to be $20,000.” Those are all things that if you're not a crazy person who's so in love with your bill that you look at it every day, you're going to miss that, right? You're just going to wait to the invoice. That's what everyone happens, right, Jesse?



Jesse: Absolutely. Yeah, it has been really fascinating for us to see this pattern again and again, honestly, with some of the clients that we worked with, but also within the companies that I've worked with over the years. It's just not something that is highly thought about until finance sees the bill at the end of the month or after the end of the month, and then it becomes a retroactive conversation, or a retrospective to figure out what happened. And that's not the best way to think about this.



Pete: Yeah, exactly. I mean, the best way to save money on your bill—something we see every day—is to avoid the charge, right? Avoid those extra charges. And the way you can do that is to know of an anomaly in advance. So, one of the best parts of this feature—I can't believe it, we've made it nearly five minutes into this conversation without calling out the most impressive part of Anomaly Detection—is the fact that it's all ML-powered. Now, I know what you're thinking, that you just cringed when I said ML, it's machine learning. And I cringe whenever a company markets based on machine learning. And the rule that I have is, you need to tell me how many PhDs are on your staff before I believe you can actually do machine learning.



Jesse: [laughs].



Pete: In the Amazon case, as it turns out, I could guess that they hire quite a few PhDs, so I feel like I'm going to give them a pass on this one.



Jesse: I feel like this is going to be a fun, over-under conversation of how many PhDs were on the team that put this service together, or built the machine learning component of AWS Cost Anomaly Detection.



Pete: I'll tell you what. It's good to be more than most SaaS services, that market towards machine learning.



Jesse: Absolutely.



Pete: Now, we got an insight into this from the product manager in advance; got to check it out, which was great. And then we learned that this is just there. It's in your account right now; you can go into your Amazon account, and enable Cost Anomaly Detection right now, and the best feature is it costs nothing. There's no charge for this. Now, there's some alerting you can set up, so there's charges for SNS or other notifications, but this service will alert you—anomalies, it will let you know of anomalies in your spend, you don't have to pay for it.



So, the best advice I can give you right now is that you should, at the conclusion of this wonderful podcast, go and enable this service and go and see what you find out. Now, there's a bunch of caveats, and we'll talk about some of those caveats. I think one of the things that we learned, which was pretty interesting, is that it will only currently let you know of spikes in spend, anomalies that are an increase in spend. And I'm sure that we can have a really long conversation with an actual PhD about why it's hard to identify dips in spend versus spikes in spend, but if you think about it, this is their initial release. And it's very clear, it's an initial release because it has the tag beta in it right in the UI. That was pretty interesting. Jesse, how many services have you seen ever pop up with specifically a beta tag within Amazon?



Jesse: I have seen plenty of services that have the preview tag, I've never seen an AWS service that has the beta tag, I feel like this is a new evolution of AWS Services that we are seeing for the very first time.



Pete: I think that's great. I mean, I have worked at companies, I've had a fight with product people about should we put a beta tag, should we not? What does that mean? What are we communicating? But I think in this scenario, it's perfect, right?



This is a new service. It's beta; it's free of any charge, other than the notifications that you might have set up. They really want users of this, and when we talk a little bit more about this, I think you'll see why, and how you can review these anomalies and report back to Amazon, how accurate those anomalies are. You can help actually improve the algorithm, which is pretty powerful if you think about it.



Jesse: I think that's one of the best features of this being in beta is that because it uses machine learning at its core, there is so much to be learned from many, many AWS customers enabling this now and training the model on their data and giving AWS the opportunity to continue to hone its model for each individual customer and for new customers.



Pete: Yeah, exactly. So, when you go into Cost Explorer—which is where you'll find this within the Amazon console—there'll be Anomaly Detection, you can find it and has the helpful beta tag for you to call it out. When you hop into it the first time, it actually has a self-guided tour: a pop up the first time you're going in there, it'll walk you through—it’s not an entirely complex application, but it will still walk you through what does each section mean? How do you set this up? How do you start using this? And kudos. Seriously, non-sarcastic kudos to the Amazon product teams and engineering teams for building that in. We should see more of those types of tours.



I know that Jesse and I both had issues with the AWS Detective service in that it just dumps you in and you're like, “I have nowhere to go.” This was like, not—it wasn't long, maybe—what—four-step tour, but it definitely explained where you are, and what you need to do for this to provide value to you. So, kudos for that; hopefully, we see that in a lot more services. So, as you go through, it'll start explaining, kind of, the Overview Section, the Anomalies Section, and the Alerting Section. It's really basic, and those are the three main areas. And it claims that it will automatically alert you for changes in your overall spend. It will create these anomalies for any type of anomaly that it thinks exists. But then there's these alerts. This is, like, these custom monitors you create. And there's a whole slew of custom monitors. Jesse, what did you find as we dove into the custom monitors that you can create for anomalies?



Jesse: This was definitely one of the most interesting features for me. It's something that I have definitely seen with a lot of customers, but have not really actively thought about until I saw it here. So, again, kudos to the team who built this product for creating multiple different custom monitors. The easiest way to get started with Anomaly Detection is enabling AWS Services Monitoring. So, effectively, you are looking at each AWS service individually and saying, “Does the spend for this particular service go up or down? Has my EC2 spend gone up or down? Has my S3 spend gone up or down?”



Or in this case, has it only gone up at the current state of affairs. But there's other opportunities as well, which I'm really excited about. We can create a custom monitor that looks at a specific set of linked accounts. So, if you have separated your AWS accounts into multiple different accounts based on business units, based on different application environments, or other criteria, you can create monitors that allow you to specifically alert on anomalous spend in production, or anomalous spend in development, or anomalous spend for one particular business unit, or one particular team depending on how you've sliced and diced your AWS account structure. So, there's lots of opportunities here to specifically focus on the things that you care about, and spend less time worrying about the components that you don't care about.



You can also use AWS Cost Categories or cost allocation tags as your monitor to slice and dice anomalous spend based on specific tags that you have created. So, if you've created tags for different products or tags for different teams, maybe you're not at the point where you're ready to break AWS spend out into different linked accounts, but you still want to alert on anomalous spend in different teams or different business units, you have that opportunity here with AWS Cost Categories and cost allocation tags. So, right out of the gate, not only is AWS telling you that they will look at your overall bill and alert you to anomalous spikes, but they're giving you multiple different vectors to alert on anomalous bad, which is really, really fantastic.



Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



Pete: As we went through the setup process, selecting on what we wanted to monitor, at least for our accounts, was pretty easy. We just said Amazon services. But if we had a lot of accounts, or if we wanted to break it out like Jesse said, we could have selected on one or the other. But one of the first custom options that you have to fill out is something called an alert threshold. And the way that Amazon explains this area is a threshold is not the same as an anomaly.



So, anomalies, those are things that are detected via machine learning. And those are things that happen, completely separate to the monitors that you create, the monitors are the things that are going to notify you. And it even says, for example, you could set a zero dollar threshold alert of every anomaly, even if the cost impact is $1. And even though that sentence is written extremely poorly in the documentation, what I think they're trying to get to is if you wanted to get alerted for every anomaly that Amazon identifies via their ML, you put a threshold of zero. So, this, kind of, begs the question of, “Well, what dollar amount do I put here?”



And that is a big question, and it's going to be different for every organization. Maybe you want to look at your total spend, you spend $100 a month, do you want to be notified of a 10 percent swing? Put $10 in there, do you want to be notified of a 1 percent swing? Put $1 in there. So, I think it's best to think of this as what's the percentage spike that you would raise an eyebrow at if you were looking through Cost Explorer on your own. And that's the threshold you put there.



Now, granted, this is again for alerting you, and there are a lot of different ways to alert. You can use SNS to alert via countless options: send it to a Lambda, send it to a chatbot. But you can also summarize these alerts, so if you want every individual alert, send it through SNS; that's actually your only option. If you wanted daily or weekly summaries, you can identify right in the monitor alert and put in those email addresses you want, and those will just get created for you. So, that's really the only hard part of setting this up is identifying what the threshold anomaly you want to get identified. And my guess is, as we see some anomalies, that's a threshold that we can adjust later if we're getting alerted too often or not enough.



Jesse: I totally agree. And I think one of the big things to highlight there is this may be something that you can set up now at zero dollars to alert on everything and fine-tune later. Let the alert be noisy until you find the right threshold. I would rather have more alerts than too few alerts and may end up being something that you have a business conversation with engineering leadership and finance to understand what is the threshold that they want to be notified of for these alerts. Is it something that we create individual alerts for the individual teams so that the teams can receive individual alerts, but maybe engineering or finance leadership only cares about percentage point increase in spend, or maybe they care about a certain dollar amount? Those things are important items to discuss as you're enabling these alerts. But ultimately, you can start with a very noisy alert and then change it later.



Pete: Yeah, always remember, too, Jessie and I are former operators. We fully understand alert fatigue. And at this point, we're just not sure of how many anomalies we're going to see, we actually configure ours for an SNS notification, it's good to go to the Amazon chatbot service, and that's going to dump into our Slack channel. So, this is actually a part one of two Whiteboard Confessional. We're going to let this run for a few weeks and see what we get back.



Additionally, we're going to reach out to the Amazon product owners on this application because it wasn't all roses. Getting this set up, we actually did run into issues which, honestly, I was a little surprised by, only because there really aren't a lot of configuration options. But we did run into some issues. What was the first issue we ran into?



Jesse: The first issue that we ran into was actually one of the most interesting and actually took the most time to troubleshoot because the alert message that we received, when we tried to create our first monitor very generically said, “This monitor cannot be created.” It's that beautiful mix of telling me something that I already clearly know, and I don't know how to fix it. It's not giving me that information that tells me what's the next step I need to do in order to fix this. And we actually spent, I want to say 10, maybe 15 minutes, poking around different settings with our SNS topic, poking around different settings with permissions. We also found out that, ultimately, there are a certain number of permissions that you need to enable on your SNS topic ahead of time, otherwise Anomaly Detection won't let you write to that topic. So, there's little things like that that we ran into.



But ultimately, what we found was that there was already a monitor in our account for AWS Services, and we were trying to create a monitor for AWS Services. And this service does not let you create two monitors with the same custom monitor type right now. So, off the top of my head, I thought to myself, “Sure. I get that. It's trying to de-duplicate as much as possible.”



But one thing to think about is that there are different use cases for the same type of monitor. So, for example, you might want your team that is focusing on this cost optimization work or this anomalous spend work to receive individual alerts in an SNS topic and maybe go to a Slack channel. But maybe engineering leadership or finance wants to know, at a high level, what are those alerts on a daily basis or a weekly basis. And so ultimately, there's different alerts that may come out of the same monitor. So, there's definitely opportunities here to improve the service over time to allow some of these things to be fleshed out, which I think is part of what—you know, this is a service in beta, so we weren't expecting it to be perfect, and so this is also something that we completely understand and we expect that there to be slight rough edges. But also at the same time, this wasn't a huge loss in any way. The service is still absolutely usable, and we highly, highly recommend it. I highly, highly recommend it.



Pete: Yeah, it's also important to note that neither Jesse nor I spent any time reading the documentation in advance because we wanted to represent the average Amazon user, which is, “I’m going to go in and I'm going to start clicking around,” because that's what we do. We go in and we just try to figure it out. So, we didn't actually read the documentation, and I don't know if maybe the documentation has various caveats in there. So, I'm going to apologize in advance if it turns out that, “Oh, yeah. This is well documented that it cannot do that.”



It's the disconnect, of course, though, that the documents live in a completely different area, and also the error message could have been a little bit more helpful to simply say, “You cannot do this,” right? That would be a lot easier because as Jesse said, once I was debugging IAM permissions for the SNS topic, I kind of thought, “Wow, what has happened here?” I thought we were just going to click three or four times and make this magic happen. So, yeah, that being said, could be a caveat; we didn't read the documentation, but it's definitely something to keep in mind and, honestly, I hope it's a feature that they build for, and they bring to the surface in the future because I would like that ability. I would like an email to go to one team versus another team, maybe real-time alerts going into another location. There's just different ways that I might want to be notified. And of course, I could probably code up something with Lambda, but honestly, that just feels like a little bit of a cop-out. So, well, Jesse, what were your kind of final thoughts about this product? I personally really thought that this is really impressive initial release of a really interesting product, and it's free. But what were your thoughts?



Jesse: Absolutely. This is a product that may still be in beta, but it already has a lot of polish on it, there's a lot of really great value-add to this product, and if it's free, there's no reason not to use it right now. Even if you set your alerts at a high threshold—so maybe you're not getting email notifications regularly—just to see what kind of information the service shows you in terms of anomalous spend, I highly recommend enabling this service, just seeing what information it comes up with. And one thing that we didn't potentially talk about in too much detail, but I think is really important to note: because it is a machine learning model, you have the opportunity to train the model. So, when you receive alerts, you will receive a notification that says was this alert actually useful? Was this spend actually anomalous? And you can train the model to say, “Yes, the spend was anomalous,” or, “No the spend was not anomalous.” And help the model get better and better understand your spend, better understand any new customer’s spend. So, at the end of the day, for a free feature with minimal configuration setup, I highly recommend it.



Pete: Absolutely. It's free. It's free. Just go turn it on. Well Jesse, thanks again for joining me, for helping me deep dive into yet another Amazon product. Join us in a few weeks, I think, is our hope that we'll have some anomalies. That we'll see some alerts get sent to us but also, too, I think we're going to reach out to the account managers of—or the product owners of this service and get some clarification. Maybe learn a little bit more: what did we miss that we really should have been talking more about, and share that as well.



Thanks again, Jesse. Again, really appreciate it. If you enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell Corey that you still miss his hot takes. This is Whiteboard Confessional. Thanks again.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 02 Oct 2020 03:00:00 -0700
Paternity Leave (AMB Extras)

Links Mentioned

Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 30 Sep 2020 03:00:00 -0700
Cost Anam--Anom--screw it, Cost Outlier Detection
AWS Morning Brief for the week of September 27th, 2020.
Mon, 28 Sep 2020 03:00:00 -0700
Inspecting Amazon Detective (Whiteboard Confessional)

Links


Transcript

Corey: This episode is sponsored in part by Catchpoint. Look, 80 percent of performance and availability issues don’t occur within your application code in your data center itself. It occurs well outside those boundaries, so it’s difficult to understand what’s actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course, validate how reachable their application is, and of course, how happy their users are. It helps you get visibility into reachability, availability, performance, reliability, and of course, absorbency, because we’ll throw that one in, too. And it’s used by a bunch of interesting companies you may have heard of, like, you know, Google, Verizon, Oracle—but don’t hold that against them—and many more. To learn more, visit www.catchpoint.com, and tell them Corey sent you; wait for the wince.



Pete: Hello, and welcome to the AWS Morning Brief: Whiteboard Confessional. You are not confused. This is definitely not Corey Quinn. This is Pete Cheslock. I was the recurring guest. I've pushed Corey away, and just taken over his entire podcast. But don't worry, he'll be back soon enough. Until then, I'm joined by a very special guest, Jesse DeRose. Jesse, want to say hi?



Jesse: Howdy everybody.



Pete: Jesse and I are two of the cloud economists that work with Corey here at The Duckbill Group, and I convinced Jesse to come and join me today to talk about a new Amazon service that we had the pleasure—mm, you be the judge of that—of testing out recently, a service called Amazon Detective. This is a new service that I want to say was announced a couple of weeks ago, actually longer than that because, as you'll learn, it took us a little while to actually get a fully up and running version of this going, so we could actually do a full test on it. But as you can imagine, we get a chance to try out a lot of new Amazon services. And when we saw this service come out, we were pretty excited. Jesse, maybe you can chat a little bit about what piqued your interest when we first heard of Amazon Detective.



Jesse: So, we here do a lot of analysis work with VPC Flow Logs. There's so much interesting data to be discovered in your VPC Flow Logs, and I really enjoy getting information out of those logs. But ultimately, digging into those logs via AWS’s existing services can be a bit frustrating; it can be a bit time-consuming in order to go through the administrative overhead to analyze those logs. So, for me, I was really excited about seeing how AWS Detective automatically allowed us to dig into some of that data, ideally more fluidly, or more organically, or naturally, to get at the same information with, ideally, less hassle.



Pete: Exactly. So, for those that have not heard of AWS Detective yet, I'm just going to read off a little bit about what we read on the Amazon documentation that actually got us so excited. They talked a lot about these different security services like Amazon GuardDuty Macie, Security Hub, and all these partner products. But finding this central source for all of this data was challenging.



And one of the things they actually called out which got us really excited is these few sentences. They said, “Amazon Detective can analyze trillions of events from multiple data sources such as Virtual Private Cloud (VPC) Flow Logs, AWS CloudTrail, and Amazon GuardDuty, and automatically creates a unified, interactive view of your resources, users, and the interactions between them over time.” It was actually this sentence that got us really excited because, as Jesse mentioned, we spend a lot of time trying to understand our clients’ data transfer usage. What is talking to what? Why is there charge for data transfer between certain services? Why is it so high? Why is it growing? And we spend, unfortunately, a lot of time digging around in the VPC Flow Logs. So, when we saw this, we got really excited because—well, Jesse, how do we do this today? How do we actually glean insight from Flow Logs?


Jesse: It's a frustrating process. I feel like there has got to be a better way for us to get this information from a lot of our clients, and every single time we have to ask our clients to send over or share these VPC Flow Logs. There's that little wince of the implied. “I’m so sorry that we have to ask you to do it this way,” because it's doable, but it requires sinking data between S3 buckets, creating and running Athena queries, there's lots of little pieces that are required to build up to the actual analysis itself. There's no first-class citizens when it comes to analyzing these logs.



Pete: It's really true. And Athena, the Data Factory—the Data Glue—what is it? Glue. You have to create a Glue Catalog. It's just a lot of work when we're really just trying to understand who and what are the top producers, consumers of data that is likely impacting spend for a client.



So, we saw this and we thought to ourselves, “Wow, that one sentence it put in the list, it said, ‘The interactions between all of these resources and users over time.’” We got really excited for this. We also got excited because, of course, we love understanding how much things cost, but the pricing for Detective, it didn't seem that crazy. I mean, it's not great, but it's all based on ingested logs, which they don't really describe. So, our assumption is that if you send it your VPC Flow Logs, or CloudTrail logs, or whatever, you're going to pay for those on top of probably already paying for them today. So, that could be a deal-breaker for some clients out there.



Jesse: That's the thing that was super frustrating for me, or super interesting for me is that AWS Detective, in terms of pricing and in terms of technology and capability, doesn't replace any of these other components. It is additive, which, generally speaking, I think is great, but when you start looking at it from a price perspective, that means that you're going to pay for CloudTrail logs, and VPC Flow Logs, and GuardDuty, and Macie, and all of these other services, and now you're going to pay for AWS Detective on top of that. So, it feels like you're paying twice for a lot of these services, when you could do a lot of the same analysis work yourself. And it's probably not going to be as clean to do it yourself in terms of building out the Glue Catalogs that we talked about building out, Athena tables and queries. But ultimately, it may be less expensive because it's not ultimately paying for all these additive services on top of each other.



Pete: Exactly. I think we're definitely not being fair to the Amazon Detective product teams because we're trying to use this service, or we're hoping this service solves a really specific painful use case for us. And really, it's just based on what we found in their public-facing marketing.



So, how does this actually work? Well, we found some really great information online via Amazon. They did a great job documenting how this all works. Essentially, you enable Amazon Detective, and you enable CloudTrail, and VPC, and GuardDuty, you have to enable it in multiple accounts, and Jesse can talk a little bit more about some of the caveats we ran into just setting it up within our own services. What it does, though, it will distill that data down.



So, it's going to consume all of these different data sources in, it will then give you this—ugh, it sounds terrible to say it—a single pane of glass for these different log types. So, if you have, for example, an IAM user that is associated with a large amount of network data transfer, could that be an exfiltration data attempt or something like that? So, essentially, what they're trying to solve here is, it's like a SIEM/SIM for Amazon created logs. That's really what it felt like to me after we had gone through this. What did you think, Jesse?



Jesse: I agree. I definitely felt like this is Amazon building their own SIM solution within AWS to effectively make all of these logs and alerts first-class citizens such that you don't have to send all of this data from your CloudTrail logs, from your GuardDuty findings, from your VPC Flow Logs into a third-party solution. You can send all of it directly to Amazon Detective, and that allows you to ultimately click through a lot of the findings in a way that creates deep links. So, ultimately, if you look at that single pane of glass—it hurts me to say it, too—then you can ultimately click through a finding to the GuardDuty page where GuardDuty is looking at the finding, or to the CloudTrail logs page, where CloudTrail can dig in deeper. There's a lot of opportunities for this deep linkage to allow you to better dig into the data that you would not ultimately get from a third-party solution; there would be a lot of back and forth with a third party solution between tabs and accounts, and it's a lot easier, a lot smoother with Amazon Detective to get all this data, or to click through a finding and find more information, find the information you need, and find the potential solution or potential remedy to the problem immediately.



Pete: Yeah, it does a really good job, from our testing, pulling these different disparate data sources together, and giving security engineers the ability to act on it. And where I think this could be actually a huge benefit is that there are a lot of companies that just don't have dedicated security teams. They still need to make it through SOC 2 or PCI audits, HIPAA compliance reasons, they need to show to auditors that they're analyzing these security threats, that they have this type of technology, and this could be a really easy way to get up and running. So, we took it on ourselves to go and turn on Amazon Detective because again, we wanted this to solve our VPC Flow Log, kind of, discovery issues.



And while diving into it, as you can imagine, with a new Amazon service—or well, most Amazon services—there are some caveats. There are some rough edges that you have to be careful about. One of the things that we found, and why we turned it on in the first place, was you get a 30-day free trial. So, you can go and turn this on for your accounts; it is absolutely free for 30 days. But there was a very interesting caveat around this 30-day free trial when we turned it on, Jessie, what was this wonderful caveat?



Jesse: You get a 30-day free trial, but when you first turn Amazon Detective on in any account, it takes a minimum of 14 days to baseline. And what they refer to as baselining is effectively ingesting all of the data from this particular account, or multiple accounts if you are using AWS Organizations and pulling in data sources from multiple accounts—which we'll get to in a minute—and it brings all that data together in one single pane of glass, and runs some machine learning or AI analysis on top of this data, but it takes two weeks to set up. You have to wait a minimum of 14 days in order to get any data.



Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



Pete: That sounds like nearly half of my trial period that I'm just waiting. And it is; you would be correct that you would be waiting in about half the trial period. Now again, is that a deal-breaker for a lot of people? Probably not. As we found the—what is it—remaining 16 days of our trial was more than enough time to get a feel for what AWS Detective can do. But still, it felt a little… it felt a little Amazon of them.



Jesse: Absolutely. It was this great moment of, “Okay, we're here, we're ready, we're going to kick the tires.” We turned everything on, we invited other accounts, and then immediately it says, “Fantastic. Go take a coffee break, go back to your daily life for 14 days, and then come back, and then we might have some information for you.”



Pete: What Amazon doesn't realize is that in these current times, I don't actually have a life to go to. So, I just sat there hitting refresh for the next 14 days. It was a long, long wait.



Jesse: I can vouch; I did the same thing. It was part of my morning routine. Just, I have always enjoyed watching paint dry, and this was equally enjoyable and equally fun to just wait on that one main dashboard screen, not knowing if data was being adjusted, not knowing if there was any progress being made, just seeing the one single status banner that read, “Your data is currently baselining. Please wait.”



Pete: So, that wasn't the only thing we found. One of the other really interesting caveats—edge cases—is that for this to work, just, at all, you need to actually enable additional data sources. So, I mean, I don't know who you are if you don't have CloudTrail enabled. You should have CloudTrail enabled. But if you don't have CloudTrail enabled, you'll have to go and do that. You'll have to make sure that it's enabled for any accounts you want to integrate within the service. You need to turn on GuardDuty, you may not be using GuardDuty, but you need to go enable GuardDuty. And additionally, you need to enable the VPC Flow Logs for whatever VPC you want to include on this one.



These, in some cases, can cause additional charges to your account. One thing that I will say—because I do want to say something nice about Amazon—is that with this new service, they make it really clear during this trial period, how much this is going to cost you. So, there is a section in Amazon Detective that will essentially tell you, per account, which accounts that you enabled, whether it's 5 or 500, how much data it has ingested from those accounts, and essentially what your estimated bill is going to be so that when the trial is over, you actually do get informed; you get real information on what you want to do. And honestly, I love that feature. I think all new Amazon features should include that, especially including that ability to let you try it first, but also just say, “Hey, this is what it's going to cost you.” And then it's really for you to say, “Yeah, okay. This is worth it for me.” I think that's something that was great, and so kudos to the Amazon product team for including that in.



Jesse: One thing that we always discuss with our clients, always highlight with our clients is the importance of thinking about cost in every aspect of cloud cost optimization and management. And so if you are able to think about how much money you are going to invest, if you are able to predict how much money you are going to invest in a new architecture feature, or in this case, enabling Amazon Detective, it really helps you understand how much more money am I investing? How much more money am I spending on this service? Is that ultimately worthwhile for the business? And then you can make an informed business decision based on that information, rather than going in blindly saying, “We need a SIM solution, or we need some kind of additional security.” And then suddenly get the bill later and balk at it.



Pete: Exactly. So, we kicked the tires, we did spend the last of our trial period diving into the dashboard. We added some real data from some of our internal Duckbill accounts so we could see things going on. And that was great. It would, obviously, be probably a lot more useful if we had a lot more volume going on.



But, you know what happened as we dove into this one? Well, when you go in blind to a new service, especially a new Amazon service, there isn't always a lot of great prompts to help guide you along. And this was no exception. When we landed onto the console after the baselining period was clear, you essentially just land on a search page, and it's just blank, and there's a search bar with a couple of suggestions. So, since there are no suggestions on things to search for, you have to at least start by picking the thing you want to search upon, whether it's IP address or account number, or something like that, only then will it then say, “Hey, here is some recommendations of things that you might want to investigate.”



So, you really would use this because you know what you're looking for, or potentially you got an alert from another service. And I think that's where it's supposed to tie these together, is that it's the place you go after you got the alert from GuardDuty, or after you saw something strange. It didn't really feel like a done solution where, like, this is the place that you come to start. It's almost like this is where you go when something's happened. That was kind of my feeling on it. What were some of your thoughts, Jesse?



Jesse: I agree. I think that there is a ton of power in this service, but it's not intuitive. And that may be partially us diving in without having more data, that may be partially us diving in without poking around other services that flow into the service, like GuardDuty, and Macie, and VPC Flow Logs. This service has so much potential, and there is so much opportunity here, but it is very, very overwhelming to load the main dashboard and see just a single search pane. Like you said, I felt like I needed to know what I was looking for going in, immediately.



This is not something that I had easy browse capability. But again to AWS’s credit, once we did start poking around, there is tons of amazing information. It's deep knowledge. As we mentioned before, deep links to other services, lots of really thorough, intricate details for things like VPC Flow Logs, and for findings that really allowed us to get a really clear picture of what was going on in our AWS accounts. So, I was really impressed with the amount of detail in all of these findings. But again, I would not have known that that amount of information, that amount of detail was available to me simply from the main search screen.



Pete: Yeah, only when you start searching upon different, maybe, IAM users or IP addresses does the full power of this application really become apparent, where you can see different applications by accessing IPs, by ports used, by bandwidth used, even to see API calls by IAM user, which I thought was super interesting as well. All of those little things buried within a search interface that, maybe if you're a security engineer, this is the solution that you were looking for because you have the questions, you just don't have the interface to go search upon it. So, I think what's great is that this is just the first version of this; this is just what was released. I'm actually very excited to see where it goes from here because one thing that we do know about Amazon is, actually, they do listen to their customers. And if you are a user of this service, you should definitely give it a try. If you're not, it's a 30-day free trial. You really don't have anything to lose.



And before you decide to continue on with it and pay for it, they'll estimate about what it will cost you per month, then you can make that decision. So, high level, Amazon Detective, pretty cool service seems to be a bit of a missing link between a lot of these other services that would generate a lot of this data. So, probably has been in planning for a long time because there's a lot of these other services that didn't have any sort of centralized way of reporting on it. So, I think that's really cool. It's a really interesting service, and it's something we're going to keep an eye on at Duckbill Group, and you should definitely check it out as well. Well, Jesse, thank you so much for joining me today and being a proxy for Corey. Two of us don't seem to quite equal a Corey, but we'll keep working.



Jesse: Thank you so much for inviting me to this. I look forward to future sessions as we kick the tires on other services.



Pete: Fantastic. Well, if you've enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review, give it a five-star rating on your podcast platform of choice and tell me how much you miss Corey. Jesse, thank you again. Thank you very much. This is AWS Morning Brief: Whiteboard Confessional.



This has been a HumblePod production. Stay humble.

Fri, 25 Sep 2020 03:00:00 -0700
Reader Mailbag: Billing (AMB Extras)


Links Mentioned



Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 23 Sep 2020 03:00:00 -0700
EC2 Gets t4gging Support
AWS Morning Brief for the week of September 21, 2020.
Mon, 21 Sep 2020 03:00:00 -0700
Chef Gets Gobbled Up (Whiteboard Confessional)


Transcript



Corey: This episode is sponsored in part by Catchpoint look, 80% of performance and availability issues don't occur within your application code in your data center itself. It occurs well outside those boundaries. So it's difficult to understand what's actually happening. What Catchpoint does is makes it easier for enterprises to detect, identify, and of course validate how reachable their application is. And of course, how happy their users are. It helps you get visible and to reach a bit availability, performance, reliability, of course, absorbency. Cause we'll throw that one in too. And it's used by a bunch of interns and companies you may have heard of like, you know, Google, Verizon, Oracle, but don't hold that against them. And many more. To learn more, visit www.catchpoint.com and tell them Cory sent you, wait for the wince.



Welcome to the AWS Morning Brief: Whiteboard Confessional, now with recurring perpetual guest, Pete Cheslock. Pete, how are you?



Pete: I'm back again.



Corey: So, today I want to talk about something that really struck an awful lot of nerves across, well, the greater internet. You know, the mountains of thought leadership, otherwise known as Twitter. Specifically, Chef has gotten itself acquired.



Pete: Yeah, I saw some, I guess you would call them, sub-tweets from some Chef employees before it was announced, which is kind of common, where responses ranged from, “Oh, that's something new,” to, “Welp.” And I've thought it—I was like, “Wow, that's interesting.” Of course, then I start looking for news of what happened, of which we all found out not long after.



Corey: Before we go into it, let's set the stage here because it turns out not everyone went through the battles of configuration management circa 2012 to 2015 or so—at least in my experience. What did Chef do? What was the product that Chef offered? Who the heck are they?



Pete: So, Chef, they were kind of a fast follower in the configuration management space to another very popular tool that I'm sure people have used out there called Puppet. Actually, interestingly enough, the founders of Chef ran a consulting company that was doing Puppet consulting; they were helping companies use Puppet. And both of those tools really came from yet another tool called CFEngine, which in many ways—depending on who you ask—it's kind of considered the original configuration management, the one that had probably the earliest, largest usage. But it was very difficult to use. CFEngine was not something that was easy, it had a really high barrier to entry, and tools like Puppet and Chef, they came out around the, let's say 2007, 8, 9 10 timeframe, were written in Ruby which was a little bit easier of a programming language to get up and running with. And this solved a problem for a lot of companies who needed to configure and manage lots of servers easily.



Corey: And there are basically four companies in here that really nailed it for this era; you had Puppet, Chef, Salt, and Ansible. And in the interest of full disclosure, I was a very early developer behind SaltStack, and I was a traveling contract trainer for Puppet for a while. I never got deep into Chef myself for a variety of reasons. First and foremost was that its configuration language was fundamentally Ruby, and my approach back then—because I wasn't anything approaching a developer—was that if I need to learn a full-featured programming language at some point, well, why wouldn't I just pivot to becoming, instead, a developer in that language and not have to worry about infrastructure? Instead, go and build applications and then work nine to five and not get woken up in the middle of the night when something broke. That may have been the wrong direction, but that was where I sat at the time.



Pete: Yeah, I came at it from a different world. So, I had worked for a startup that no one has probably really ever heard of, unless you have met me before, like, know who I am, but a company called Sonian which was very early in the cloud space. It was email archiving, so it wasn't anything particularly mind-blowingly interesting because it's compliant email archiving, but what was interesting is that we were really early in the cloud space, and a lot of the tools that people use today just didn't exist for managing cloud servers. It was 2008, 2009, pretty early, EC2 timeframe. How would you provision your EC2 instance, back then? Maybe you use CFEngine, maybe use Puppet.



And actually, interestingly enough, that company—Sonian—was originally a Puppet shop because Chef didn't exist yet. And there were a series of issues we ran into, technical capabilities that Puppet just couldn't do for us at the time. And again, that time being 2009, 2010, and a lot of the very early Chef team, founding team, early engineers, were really working with us very closely to bootstrap our business on Chef writing a lot of those original cookbooks that became community cookbooks. And so, my intro into Chef and the Chef community is a lot earlier than most, and I went a lot deeper with it just by nature of being so early into that space.



Corey: One of the things that struck me despite not being a Chef aficionado myself was, first, just how many people in the DevOps sphere were deeply tied into that entire ecosystem. And two, love or hate whatever the product, or company, et cetera, did, some of the most empathetic people I've ever met were closely tied to Chef’s orbit. So, I have not publicly commented until now on Chef getting acquired, just because I'm trying to give the people who are in that space, time to, I guess, I don't know if grieve is the right word, but it's important to me that I don't have a whole lot to say there, and it's very easy for me to say something that comes across as crass, or not well thought out, or unintentionally insulting to a lot of very good people. So, I'm sitting here on the sidelines watching it and more or less sitting this one out, but it's deeply affected enough people that I wanted to talk about it here.



Pete: Yeah. And I'm glad that we are taking this opportunity to talk about it a bit. I had a lot of thoughts and feels. I tried to write a blog post about this to try to get them down somewhere, and a couple of paragraphs into it, I just, I really couldn’t… it just seemed like a meandering random mess of words without any real destination. But a few people online have mentioned this, and I'll definitely call it out as well, which is that Chef was, it was a tool. It was a tool like any other. You either loved it or you hated it. If you hated it, you probably really loved Ansible, or you really loved Puppet. It was a really, kind of, Vim versus Emacs feel to it, where you either we're all in on it or not.



But the thing that I think Chef really brought for me is not only leveling up my career in a way that I would not be where I'm at today if it wasn't for that tool and that community, but just how genuine everyone was within that community, and the interactions that we had at conferences, at Chef conferences, DevOps conferences and things of that nature, and even continued the conversations online back before Slack, which it's hard to even remember that: when we all were on IRC, and we were in the Chef IRC channel, and it was a fantastic channel with a ton of people who would dive in and help you out on your Chef problems. But I think the biggest thing that tools like Puppet and Chef really helped a lot of struggling sysadmins like myself is it got you into programming in a really safe, easy way. When you wrote a Chef cookbook, you were actually writing Ruby. It was its own custom DSL for Ruby, but what was great about it is that you could translate what you were writing in Chef, and how Chef operated into, then, other tools. And I really attribute a lot of those early tools that came out—Sensu was written in Ruby, Vagrant was written in Ruby, imagine all of those early DevOps tools—again, before Golang—were written in Ruby, and I wonder if it's because of tools like Chef and Puppet, which were also in that language, and it was now very approachable for people to build in.



Corey: I would argue at this point that if I take a look now through the lens of, you know, the last 15 years or so I spent in this space, with the benefit of that hindsight, Chef was clearly head and shoulders above its competitors and a lot of different ways. I feel like their, more or less, acquisition if we view it as a loss—which it appears that a lot of people do, so I'm going to defer to the wisdom of crowds there—is fundamentally because of that the fact that the industry as a whole pivoted away from running these long-lived instances that need to have their configuration updated on a consistent, ongoing basis. Containers and stateless architecture, more or less, have eaten the world. The stateful things like databases are increasingly being handled as managed services, and all these companies are open-source projects then, with enterprise offerings. And okay, great. We have to manage our database servers for those exception cases with Chef. That's great. A lot of those shops find that the open-source version of these products is more than sufficient for their needs because, let's face it, the enterprise versions are not inexpensive.



Pete: Exactly, I think there’ll be a lot of armchair quarterbacking on why Chef didn't succeed like it should have. I will also armchair quarterback and tell you that my thoughts on it is that—and this is not an original thought in any way. I have very few original thoughts—but open source is not a business model. And one thing that I—I think I put this on Twitter last year at some point when Docker had a similar end—actually Chef had a more successful end than Docker—but Docker raised obscene amounts of money, hundreds of millions of dollars, and they were sold for like assets, right? They basically were just dissolved.



What was most shocking about Docker essentially going away—I mean, obviously, the open-source will be around forever, but Docker was the thing that really ate Chef's lunch. If you didn't need to really care about the server anymore because you were just deploying your applications within these containers, then you don't actually need Chef anymore. So, Docker was really successful at destroying Chef—and arguably Puppet’s—business without ever actually figuring out a way to make money itself. And to me, that is one of the most spectacular things ever.



Corey: This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.



Corey: I would absolutely agree with that. I think that the entire industry shifted in a terrific way that makes things a lot more accessible. You don't need to be a sysadmin anymore to build something intelligent and workable, and for better or worse, in the way that modern architectural practices have evolved, there just isn't space for a configuration management company in the several billion-dollar valuation range.



Pete: Exactly. And you look at the history of those companies, Ansible sold to Red Hat pretty early in their life. To be honest, I'm not sure what happened to Salt. You might know better than me; I was never really that close in that community. Chef and Puppet, they just hung around, and they had real businesses.



The documentation from Progress, who was the acquirer of Chef, had claimed that they were doing about $70 million of recurring revenue. That's no slouch. That is good revenue that a lot of companies would love to see. The interesting part of that acquisition is that they were only sold for $220-ish million dollars. That's not a big multiple on top of a $70 million run rate, and it's definitely only twice what their investment was in.



This is something I've written about in the past around how startups are funded. There usually is a concept of when investors put money in for their preferred shares, there's a concept of participating preferred, which I'm going to give not any detail at all; I recommend, highly, reading smarter people than me online who talk about it. But at a really, really basic level, it allows those VCs to take out the money they put in off the top before splitting the rest of the pie. So, if you had 100 million dollars in and all of those VCs take out their portion first, now out of a $200 million valuation, you only have $100 million left to split. Of course, you'll be splitting with those same VCs who already took their money out first, as well as everyone else who has shares, so it's something that you'll see in pretty much every startup’s capitalization table, but it causes common shareholders, like the employees, many of which who’d been there five to ten, years to see a lot less for their shares.



And likely if you were an employee of Chef in the last few years, my guess is—because just the nature of VC funding—you'll have a valuation of your last round, it's hard for me to believe that those investors who put money in the last round, the price per share was probably a lot higher than the exit price. But due to this participating preferred model, they're able to at least not lose their money. Maybe they got a 2x return back because they got their money off plus the split, but again, as outsiders it's hard for us to know exactly who won on this one in the grand scheme of things.



Corey: There are too many stories in the world. Of employees who wind up not getting any form of value out of the exit that they wind up going through. There are so many different things that can happen. Finance is complicated, to be direct. A number of—I would argue most engineers are not sophisticated in the realm of corporate finance.



And that's okay; I'm not saying that they should be. There are a lot of problems in that space, and I've always viewed equity with a healthy grain of salt because I don't consider myself any different. When I was in my employment stage of my career, I was absolutely not possessed of a sophisticated understanding of these financial aspects. And even if I were, the investors on the other side of the table, well, I'm doing one of those deals, maybe once every two or three years, they're doing three or four a week. They're always going to be better at this than I am because that's what they do. And that's okay.



Pete: Exactly. For every Datadog, for every Elastic, for every one of those companies, those startups that raised some money and IPO’d for billions of dollars, there's hundreds of Chefs; there's hundreds of companies that make it to this danger level where they raise a lot of money, but their revenue is just not growing because the market has shifted so fast. You can either raise under worse terms, or you can try to take an air quote, “Exit,” to just be done with it; to be able to get something for it. And then for all of those Chefs and whatever, those companies in the middle, there's thousands of startups that don't even get to the level like a Puppet or a Chef did.



And I know that. I’ve worked for pretty much all of them at this point, where you get to the danger zone, you raise a Series B, a Series C, now you've raised 50 million, a 100 million dollars. You need to get to half a billion or a billion dollars a year in revenue—the numbers, it all just get so high. It's just the downside of VC: as the multiples get higher and higher, and harder and harder to hit, and thus, maybe companies make desperate decisions or short-sighted decisions that causes them, maybe, long term pain. Again, this is not easy at all, and I don't claim to be an expert on it. These are hard, hard problems.



Corey: Yeah, there's a lot of, again, very good people who care a lot—they're incredibly intelligent—who have been working both at the company and of course—we can't sell this one short, the larger Chef community. I really think that there's a tremendous amount of value there. Now, the company that acquired them—which I don't believe we've talked about yet—is Progress, and their tagline may as well be, “Who are you again?” They apparently have a bunch of enterprise CI/CD style software, but they're certainly not a commonly known name in the space, at least to me.



Pete: They're in Boston, apparently. I'm in Boston. I have never heard of them before. I actually had to go on LinkedIn to see if I had any first-level connections, and was shocked to find I had zero.



Corey: “Oh, they’re just in Boston for tax purposes.” “Well, doesn't that sound incredibly expensive?” “Well, they're in Boston for tax purposes that make no sense.” “What up?” “We didn't say they were good at it.”



Pete: [laughs]. Exactly. I mean, that's probably not exactly where you're going to be going for those reasons, for sure. But my hope is that, obviously, for the employees that are there that they get taken care of. I know a lot of friends who were there for a while, have left and moved on to newer places, and still have a lot of close friends there.



And with an acquisition like this, sure, you're probably not making any reasonable money from your options, but there is the great secret of an acquisition where they don't want to lose you right away. If you're an engineer, if you're an operator there, you're actually an asset to that acquiring business, and in many cases, they could put in front of you a pretty lucrative retention bonus to stick around for one year, two years, something like that. And then a lot of ways, that's where you're going to make the money on an acquisition, for good or bad, is that the new company really wants you to stay and they'll make it in your best financial interest to do so. But at the end of the day, I think it's sad how the ending happened. It was inevitable given how fast the market moved, just like it was inevitable that Docker was not going to go anywhere because this market moves so fast.



But the one thing that really rings true—and I don't think another company has done it as well, yet—is built such a strong community around technical operators. The Chef community was very tight-knit, it was very big. I look at it as a huge part of my career in leveling up my career by being able to meet just such an amazing group of people, such a wide, diverse group of operators, and engineers, and people who cared about this. And these are lifelong friends for me. It's really amazing. It's really the best part of Chef from my perspective.



Corey: I would agree with it. It seemed at some level it was trying to pivot and find its way. InSpec was great: the idea of compliance as code, which is a slightly different thing from configuration management as code, though they obviously have clear antecedents in common. The value, though, is that suddenly you can prove to auditors things about your environment. I loved the idea, but my entire impression of the space is and remains that it's effectively a sysadmin/SRE dead end with a 40 year long tail of decline ahead of it, as a space. That may be unfair, but it does definitely match to my understanding of the world.



Pete: Yeah, and it's not like Chef is going away anytime soon. It's an incredibly useful tool. There's a lot of companies that honestly are not using it that should be using it. There's a lot of enterprises out there—remember, if you go to a devopsdays event, those people that are there are such a small fraction of just overall business. There's tons of people that still haven't done anything with configuration management, let alone Docker. Docker is so far away from a maturity level where they're at.



So, Chef is definitely not going away, and it's not like people are going to go suddenly rip it out, and it's open-source still. Depending on what they do with it. There's a whole licensing change which, obviously, we only have so much time to get into all of the interesting changes over time, but it's an open-source project; there's still a community around it. There's still people who really like the project, and I kind of bask my retirement on the hope that 20 years from now, there's some company who has some antiquated Chef install—like today you have some antiquated COBOL mainframe—where I can just live out my retirement working 10 or 20 hours a week fixing someone's legacy Chef install. That's really my hope for the future.



Corey: I think that's probably the best place to leave it. I hope for the sake of people I know, trust, and like, that things go well. I would love nothing better than to be completely wrong in everything that I just said because I want to be.



Pete, thanks once again for joining me. This is the AWS Morning Brief. If you've enjoyed this podcast, please go to lastweekinaws.com/review and give it a five-star review on your podcast platform of choice, whereas if you hated this podcast, please go to lastweekinaws.com/review and give it a five-star rating on your podcast platform of choice, and tell me in the comments why I'm a complete idiot about Ruby.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 18 Sep 2020 03:00:00 -0700
Is the AWS Free Tier Really Free? (AMB Extras)

Links Mentioned



Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 16 Sep 2020 03:00:00 -0700
Going Flat Out Like A Koala In Season
AWS Morning Brief for the week of September 14th, 2020.
Mon, 14 Sep 2020 03:00:00 -0700
Pulling Back the Curtain on Palantir (Whiteboard Confessional)

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links



Transcript

Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.



Corey: Welcome to the AWS Morning Brief. I'm Cloud Economist Corey Quinn, And this is the Whiteboard Confessional segment that has increasingly been taken over by my colleague, Pete Cheslock, as we tear through various cloud-based companies, public filings as they race to go public and inflict their profits, or in more cases, losses on the public markets. Pete, thanks for joining me.



Pete: I am super happy to be back again, and making my mother happy that I'm actually using that MBA that I spent all that time to get.



Corey: So, we could wind up talking just about how Palantir is awful in a variety of ways. My personal favorite was the letter that their CEO attached saying that effectively engineers were stupid and didn't see the big picture, which is a weird thing to say about a whole group of people you're actively trying to hire, but all right. Let's talk about their S-1 filing. This has been anticipated for a while. What do you think?



Pete: Well, Palantir has been around for a very long time. I think it's been around a lot longer than a lot of people realize. You know, early 2000s. It was technology built to tie data together and to be honest, I only know—I’ve ever heard of one company actually using Palantir—the technology—a commercial company. They were actually using it as a SIM—SIM, whatever you want to call it—Security Information Management System—



Corey: Event management or something like that. Yeah.



Pete: Exactly. And ironically enough, that company actually—that was using Palantir—replaced it with an Elasticsearch ELK stack, which I thought was fascinating. I know nothing about their software, but I was very fascinated to read the S-1 because there's been this mythology around it and you can hear so much about insiders at Palantir, employees selling their shares in this wide secondary market. So, I was very curious to see what we were going to find, and there are definitely some interesting bits within.



Corey: There certainly are. And it's strange because for a while Palantir was doing interesting things in the market. They were offering $20,000 referral bonuses to people who referred engineers in for certain roles, and you didn't have to be a Palantir employee to do it, which was fascinating. They've recently moved headquarters from Palo Alto over to Denver, Colorado, which… okay. They are claiming it's for this whole lofty mission. Let's not kid ourselves: it's a tax play. [laughs].



And there's also a whole bunch of interesting stuff buried in here. But yeah, in many ways, this is a legacy company in some respects. It's been around almost 20 years. And strangely, I don't know about you, but I don't know anyone who works for Palantir. I did a little digging in preparation for this episode, and it turns out, I actually kind of do, but they're very quiet about it. It's one of those things where people don't want to be called out for working at a company that is this particular flavor of controversy, and I can't say I blame them.



Pete: Yeah, I haven't looked through my LinkedIn to see if any of my connections have ever worked there. Granted, it's such a West Coast company that me out in the East Coast, be pretty rare to run into anyone out here who's kind of taken their time and done the Palantir. I have heard, again, the rumors that they've always paid very well, and—



Corey: They would kind of have to.



Pete: You know, in the Bay Area, you kind of have to. And competing for talent against other places who pay really well, like Netflix, and Uber, and all these other big companies that are out there. So, it's a big competition for the top talent.



Corey: Oh, yeah. And most of what they do is data analytics. They take in a whole bunch of data, and they crunch a whole bunch of numbers and come out with other stuff. Historically, they have been focused on selling their services to governments, but now they're expanding in the enterprise story as well. And that is, of course, going to be a bit of a challenge for them as they expand into it, but we can talk about what they do, how they do it, and all the other challenges. Let's talk about Cloud. What do we know about their cloud environment based upon their public filing?



Pete: Well, they talk about their commitments. So, this is something you often see in S-1s of their various cloud commitments, and I think this one was super interesting in that they listed commitments for about $1.5 billion in cloud commitments over six years, and this was an agreement they entered into at the end of last year. Just a massive, massive amount of cloud spend commitment, right?



Corey: Yeah, it’s a quarter billion dollars a year in spend. Which is, again, we see a number of customers in that range pretty frequently, it's not always typical to see the better part of a decade done to satisfy those commitments, though. Usually they're, “Well, this stuff is always changing. Let's talk about doing this for the next three years.” Six is a bit on the outside range of what we tend to see.



What's fun to me was the breakdown of that commitment, which was just—I've been using this as a talking point for a week now—which is they have to undisclosed cloud companies in this part. They mention elsewhere that they use Azure and that they use AWS. Great. Fine. For one cloud provider, they have a six-year commitment of $1.49 billion, which is an enormous pile of money. The other provider they have is a little bit less than that; for $45 million committed over a five-year span. That's almost an order of magnitude difference.



People ask me from time to time, “Well, why do you focus on AWS and not other cloud providers?” This is the purest example of what we've got to prove this. When we talk to companies about their expensive cloud bills, the AWS problems add a zero to the end of what they tend to be with other cloud providers. It's not out of preference on our part, and it's not out of, “Well, this is what we know and learning other things is hard.” It's that when we hear complaints about GCP or Azure bills, they are not in the same universe as what we're seeing on the AWS side. In other words, their annual Azure commit is about $9 million a year, and their AWS commit—at Palantir—is $250 million a year. Bit of a difference.



Pete: Yeah, exactly. I think outside of the fact that the commitment was so high, they obviously had some really interesting contractual terms, which kind of talked about how they could shift their spend if they didn't satisfy by a certain time and things of that nature. But what was so amazing was that they talked about this $1.49 billion commitment, and then they were like—this other sentence was like, “Oh, yeah, by the way, we also have this like $45 million one, and there's some stuff on that one as well.” So, of course, I'm like, “All right, it's either Azure or GCP. Which one is it?” So, just start doing the search through and they list Azure, they list Amazon, they just don't obviously list them in the same sentence. But us as rational viewers, we can make that assumption that they're probably not doing a $1.49 billion commitment on Azure.



Corey: And we can tell that because if some company did do a $1.49 billion commit on Azure, you wouldn't be able to roll out a bed in the morning before a sponsored ad on your alarm clock was telling you about this, thanks to Azure’s marketing campaign. They would be insufferable about this and never stop talking about it. Whereas with AWS, it’s one of those. “Okay, cool.” That's a big commitment, don't get me wrong. But remember, that’s—they’re one of the larger AWS accounts, depending on what inclusivity categories you look at, and that's a $250 million a year commit. Well, remember that AWS is north of $40 billion a year in terms of revenue. So, it's not a story where you just have one or two big customers that drive everything else.



Pete: Yeah, exactly. I mean, a $250 million swing for Amazon yearly, if Palantir didn't meet their commitment—and they have some words in here that basically says that they can push it out forward—I think their minimum they would have to cut a check for is like $30 or $50 million and they can push the rest of the spend later, but even if so, I mean, even though it’s such a huge number, it's such a small percentage of the behemoth that is Amazon and their cloud. It really is.



Corey: Yeah, there's a lot that winds up factoring in. And I think that it's, it's easy to sit here and cast stones, but these are big numbers, and it's a heck of a commitment. And it's just one of those things that I find fascinating. Especially since—let's not kid ourselves—Palantir is not exactly a logo everyone wants to have on their website. There are serious challenges with that, and, “Huh, we got a lot of attention for talking about doing business with Palantir, but it's not the good kind of attention.” To be very clear, because I don't want to get letters, Palantir is not a customer of The Duckbill Group.



Pete: Yeah. I mean, they're known as very secretive company, right? And I don't think you're going to see their advertisements on billboards like you might see—or at the airport, you might see a Slack [laughs] commercial on there. So, it's a lot different company. One of the places where I think it shows just how much of a different company this is, is how they are actually doing this IPO listing. So, if anyone has been following some other companies that have followed what's called a direct listing, Slack was one Spotify was another. This is where—



Corey: I believe Google was another, too.



Pete: Yeah, you just—you sell your shares to the public. So, for companies that have a certain hype factor them, or honestly for companies who think that they can do it where a normal IPO you're going to have underwriters, these big banks, of which you've all heard their name like Goldman Sachs, and whatnot, they will essentially work with their customers to set a price for your shares to make sure that the initial—let's say you sell 100 million shares—to make sure that those shares are sold on the IPO. And then there's a lockup period, and that's where everyone else who's holding shares just has to sit and wait and hope the company doesn't just crater like in the last bubble we dealt with. Now, a direct listing is much different and there's a lot of risk to it as well.



A direct listing is anyone who has shares, any registered shareholder, which is like an employee, a former employee, the hundreds or thousands of people who hold shares in Palantir can just sell it. There'll be someone in the stock market called a market maker who will facilitate those trades, but when they list their IPO, if you have shares in Palantir, you can sell them for whatever someone's willing to pay for them. It's a much different model. But what was super fascinating is that when you do that, you could have a lot of people sell their shares, and then you could theoretically lose control.



Palantir did something extremely fascinating, very much like how Facebook did it, where they've actually created these multiple classes of shares. Class A shares, or what you're selling, those just have a single vote, but Class B shares—of which Peter Thiel has actually the largest block—has 10 times the number of votes. But then they created this new type of share—I have never heard this before. I’m curious if another company has done this—called a Class F share. And what this does, this gives the three founders 49.999999 of the total voting power. Did I get all the nines in there? Yeah. Six nines.



Corey: In other words, the entire world has to unify against them—



Pete: Exactly.



Corey: —everyone participating in order to overrule them, which is, frankly, not happening.



Pete: And what it means is that anyone can sell as many shares as they want because a direct listing means from the IPO, anyone can sell any actual shares in the company. And they can do that without the three founders essentially having to all be overrun like you said. You'd have to get everyone else on the other side of it. So, they have to be included somehow. That was just super fascinating, but it's how they're going to maintain control while allowing people to just sell their shares.



Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



Corey: So, changing gears slightly, is this company making money? I mean, there is enough of a marketing problem that they have; there's a clear question of if people do business with them, they're not going to be clamoring to do a testimonial; and anything to do with government, that's just a given, regardless. So, are they making enormous piles of money for the trade-offs that they have agreed to make?



Pete: Well, there's a lot of people online that are really happy to go and crap all over how much money this company loses, and I'm happy to be yet another one of those people that's going to crap on how much money they lose every year. Their revenue is growing. They're doing, I think, over a billion dollars a year in revenue; that's huge. But they're also burning half a billion dollars a year. $500 million a year burned—



Corey: On top of that revenue. So, they're basically making a billion dollars a year and losing half a billion dollars a year because they're spending at one and a half.



Pete: Exactly. And in some cases, their margins are thin, and their cloud spend is high. People maybe not renew. But again, what's interesting about this one is that they're—I think they talked about like the average deal sizes, these are measured in the millions of dollars. So, they're not selling Datadog: “Oh, I got my five agents and they cost me 100 bucks.”



They're selling seven-figure deals to large organizations: the biggest enterprises, to military, and government, etc. These are really large deals. And having been in that space, having to sell to the government, having to sell the big companies, those deals take years to complete. So, it's kind of no wonder that it takes them so long and that they burn so much cash to do it.



Corey: One question that comes up is what are the alternatives to using something like Palantir? I don't know enough about what they do and how but have got to be other options because again, this feels like it's relatively old tech. There has to be something relatively new that would come out and solve for this problem in varying ways. I have no idea what they might be, but I can't accept that this company is just sitting around with zero competition, not with only making a billion a year.



Pete: Yeah, and they even list in their S-1—they don't actually list direct competitors. They speak more abstractly into—that they are this big data, data play type of thing, and again: secretive company. It's hard to know from the outside what they're doing. It kind of feels like they're going into these places with their top tier engineering talent, with a core product that has the ability to make inferences between data, between different data sets, and give people insight into something they didn't have before. My guess is, is that what they're doing is that they're maybe not building custom solutions for everyone, but I bet there's a good amount of custom development work that they're probably doing when they go in and they sell one of these seven-figure deals because if you have to remember, you could be in a sales cycle for one to two years, you're probably building some custom stuff based on what the customers are asking for.



Maybe those features then tie into what someone else wants to buy as well, but they did talk about the different platforms they have, but they really didn't talk at any sort of length in their S-1 about really what the thing is they're offering other than a way of tying data together, making inferences into this data from different data sources. I almost wonder if the secretive nature of them actually is their marketing. Like, that is what gives them the hype and the aura because of how secretive they are.



Corey: And that's going to be an interesting challenge. One other area that I found fascinating to look at was something you highlighted, namely that they've committed to $250 million a year to their primary cloud provider—which let's not kid ourselves, it's AWS—they've only used about $40 million of that in six months, so that's an $80 million year run rate. Someone either over-committed massively, or they plan to have a massive growth hockey stick, which is a bit ambitious if I'm negotiating the commits. Is there any other way to read that?



Pete: That was probably the most shocking—again, there was a lot of shocking things in this S-1. Not really shocking, but things that were like, “Whoa, that's weird.” This one, from our world, we do so many negotiations on EDP commitments, seeing something like this where you've committed to essentially $250 million a year, but across six months, you've only met $40 million of that, if I was the operator there, the person who negotiated this, I might be raising my hand and be like, “Well, well, hold on a second. What just happened here?”



And it means one of two things. Either they massively overestimated how much spend they were going to need, or there's a potential that they actually have. Maybe they have some really large deals that are on the way. And I have no idea if this is the case, but around—what was it—last year or whatever, Amazon and Microsoft have been battling over this Jedi contract, which is modernizing the government to use Cloud. Well, here's Palantir, who has these agreements with two cloud providers, and they sell heavily to the government.



So, is this essentially their, “Yeah, we're expecting more of the government moving to Cloud.” Because they clearly want that. They want to be in more Cloud. This type of deal that you see is showing that they want to be in Cloud and this is potentially their big hedge on it. There was one line that I did think was interesting is that it didn't look like their downside commitment—like they didn't have a big downside if they didn't hit it. It basically said if the difference between what they committed to per year, and what they hit was off by, like, $30 to $50 million, that was essentially what they would have to pay and the rest would fall into a future year to use. So, I mean, I'm sure they don't want to pay money they don't have to, but it seems like their risk, maybe is not use it or lose it model.



Corey: Yeah, but there are ways that winds up structuring out. In some cases, you're right. I mean, it all depends on exactly what was negotiated at termination point. I do know that when you start committing to certain yearly spends, they start forecasting and counting that revenue, so, “Hee hee, we do enough of these where things don't hit, and we just let people pay it off whenever they want, now we have to restate earnings.” And that's something Amazon is disinclined to do.



Pete: Right. And I wonder, too, if you're a few years into this agreement and you need to renegotiate things, at this kind of spend I feel like you have a lot of flexibility especially to—



Corey: We’ve seen this publicly with public filings from both Pinterest and Snap, having done that. Snap committed to, I think, $2 billion with GCP, and then within the next six months, another billion on AWS, and it's maybe committing to $3 billion of cloud services over the next five years isn’t… the most responsible way of playing this. And sure enough, they extended it out another few years and added some token amount on top of it just to give them more time to burn through it. We saw the same thing, to a lesser extent commitment-wise, with Pinterest.



Pete: Exactly. I think that—



Corey: All through their public filings, to be clear.



Pete: Right, right. And I think that is what is fascinating is that this commitment exists; we can all read it; it's all public information, but anything can change at any time. It all depends on who's got the leverage in the negotiation. And if I was running a lot of things in a cloud provider and I had a way to run them in another cloud provider, even for a little while just to show that I had that capability, that can be a negotiating point. But you could also say to that, like, you could always renegotiate and just extend your end game out later. I think everything's up for negotiation when it comes to these EDPs. When it comes to these enterprise agreements, there's a lot of flexibility.



Corey: Yeah. I think we're going to see what happens because the nice thing about companies going public is they have to update us on these things every quarter, whether they want to or not. It's weird to me to see Palantir going public because for so long, they've taken the official position that much of what we do would not be well served by requiring transparency, which is always the sign someone's up to wonderful things. But, well, you want the money, you've got to go ahead and do the filing pieces. So, good luck.



Pete: I think it will be interesting to see their filings over the next few years, and I'm sure there's going to be a lot more people out there much smarter than us with their own thoughts and opinions on what they're doing and how they're doing it because it's all going to be pretty open to the public eye.



Corey: Yeah. Pete, thank you so much for taking the time to go through yet another one of these. We'll attempt to go in a different direction next week, but given that everything is going public, who the hell knows.



Pete: Thanks so much for having me again. It's always a blast.



Corey: [laughs]. This is the AWS Morning Brief. I'm Cloud Economist Corey Quinn, and if you've enjoyed this podcast, please leave flattering comments about me with your five-star review, whereas if you hated it, please insult Pete along with your five-star review.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 11 Sep 2020 03:00:00 -0700
Dipping my Toes into Digital Ocean (AMB Extras)

Links Mentioned



Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 09 Sep 2020 03:00:00 -0700
Amazon Repeatedly Stomps on Own Schmeckel
AWS Morning Brief for the week of September 7, 2020.
Mon, 07 Sep 2020 03:00:00 -0700
SnowflakeDB’s S-1: The Fine Print (Whiteboard Confessional)

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links



Transcript



Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.



Corey: Welcome to the AWS Morning Brief: Whiteboard Confessional series, where I am joined once again by my colleague, Pete Cheslock. Pete, thanks for taking the time to tolerate my slings, arrows, and other various forms of cynicism.



Pete: You know, I didn't take that much offense to the fact that I have an MBA, so I decided to come back and see if we can make use of that investment.



Corey: So, fun story. The last one of these that we did was talking about—who's one was that? They're starting to run together at this point.



Pete: That was Sumo Logic’s.



Corey: That's right. And it was, “Oh, let’s talk about what they're doing.” And then throughout the day, I think five tech companies all filed to go public, which is just bizarre. So, we're going to take a couple more episodes to slice and dice a couple more that were of interest to us.



Pete: Yeah, absolutely. We're going to chat about one that I was honestly been waiting for because of the hype and the myths around this company. But it's a big data company called Snowflake.



Corey: They're very special and unique.



Pete: They're very special. I think—I often will listen to CNBC in the background, it's kind of interesting to get little words, and sometimes tech pops up into a CNBC broadcast. When Snowflake filed. I think one of the announcers had said something to the effect of, “I don't know why you'd want to be called Snowflake.” [laughs]. So, I had a good chuckle at that one.



Corey: Because they've been around longer then that's been a disparaging term used by jerks.



Pete: [laughs]. Exactly, exactly. So, they filed their S-1 in that flurry with a whole slew of other companies—which we will definitely get to at least one more of those—and honestly, this company, I've never worked there. I did at one point go through a sales process which I can share some of my thoughts and opinions there, but the reason why I was so excited to see this one is because of the sheer amount of VC money that this company has raised, well over—I don't know if ‘well over’ but definitely over a billion dollars of VC funding raised. It's crazy.



Corey: My comment at one of the big tech conferences last year—back when that was a thing we went to—was I was walking around their booth, and I noticed that they had this mock-up of a race car suspended in the air. And then I realized, “Oh, my God, that isn't a mock-up.” Which told me at that point that if you're paying retail pricing for Snowflake, you're probably doing something very wrong.



Pete: Yeah, absolutely. I think to dive into one of my favorite Snowflake stories, at a previous company, we were checking Snowflake out—we got connected with them via some connections our head of product had and some success that we heard that Snowflake had with helping, you know, a data warehouse. That's what it is: it's a data warehouse technology. If you're in the Amazon ecosystem, you might be using Redshift. Snowflake can do some of those things, it can do some other things.



Corey: Why would I use something like Snowflake instead of Redshift? I mean, for starters, naive approach as well, okay, this is in a different Amazon account, so at minimum, I'm going to be paying data transfer in and out on both sides. But again, we're talking data warehousing, the data transfer is usually something of a rounding error compared to all the extra cost goes into that.



Pete: And this is where I think a lot of their growth in the early days came from a lot of the deficiencies in Redshift. In technologies, in the investment that Amazon was doing there, Snowflake could do a lot of things just simply better. I think additionally, too, they were probably taking a lot of business from Oracle shops and things of that nature. But I do know a friend of mine at his company, they had a well over a million dollars a month in Redshift spend, and they actually moved over to Snowflake as a cost-savings initiative. It was significantly cheaper.



But what’s, I think, so fascinating, when I heard that I was like, “Well, hold on a second. You know, Snowflake runs inside of Amazon.” So, I'm always curious of how that relationship exists with Amazon where you've got some account manager who's going to lose on some big spend of an Amazon customer by their Redshift spend going down dramatically, but then whoever the account manager for Snowflake must just be super excited by that because obviously their spend is going to go up.



Corey: Yeah, on some level, if you're running a data warehouse on top of AWS, from the high-level AWS perspective, well, is it spend that’s going to happen on your account, or is it spend that’s going to happen on Snowflakes account? It's not likely that you're going to be building everything on top of AWS, and then Snowflake is going to be running its stuff on another provider. The data transfer charges there become exceedingly non-trivial.



Pete: Yeah, absolutely. One of the things that is interesting about how Snowflake works, at least from my recollection a few years ago, is that you can stream your data into S3 which is very cost-effective. Snowflake can actually ingest your data from S3, and what they basically do is they put it into their S3. And you pay the same S3 pricing. I remember the sales guy.



He was like, “Yep, it’s just pass through pricing. You pay what we pay.” But then I said to him, I was like, “Okay, cool. Well, how is the rest of Snowflake charged?” And he said, “Well, most people just buy credits because it's all consumption-based, right? The more you use, the more you spend.” I was like, “Oh, okay. Well, how does the credit work?”



Corey: So, nondeterministic is what I'm hearing there.



Pete: Yeah, like, “How is it priced? How many credits?” “Eh, most people just buy, like, 100,000 in credits and see where it takes them.” And I'm like, “Okay. So, I just going to cut you a $100,000 check not knowing if that's going to last me one month, 12 months, 24 months?” It did not go well. [laughs].



Corey: Just pour some money in and when it runs out, well pour some more money in.



Pete: Exactly. So, it felt a little weird from that perspective. We ended up not going with it for other reasons, more so we needed something a little bit more real-time, and we had a definitely a tradition at that company to build a lot of it ourselves.



Corey: It happens to the best of us.



Pete: You know, build or buy. Sometimes you just want to build stuff I guess.



Corey: So, do we know what they're built on top of? Is it something that is disclosed? Is it their own special secret sauce? In other words, what makes a Snowflake snowy?



Pete: What is the special Snowflake behind the avalanche of data? I'm available for marketing, too, Snowflake needs some help. But the rumor that's been out there is I think it's been reported on, and so it’s, again, hard to say of what they're doing today, but at least in the past, I had heard and definitely read things online about this, that there was some FoundationDB was actually powering some of their metadata layer. FoundationDB was actually a closed-source database, and the only other company I was aware of that was using it for their SaaS product was a company called Wavefront, which VMware acquired; it was a monitoring company. And from people that I know who have used it have said extremely good things about it. What ended up happening to Foundation is that Apple actually acquired it but then I believe they open-sourced it, so it might actually be an open-source product now. But it's definitely was in the past used by it, and my assumption is like, most databases it’s probably still the case today.



Corey: It's an interesting problem. I feel like on some level, the thing that probably annoys AWS the most is that they have their Snowmobile, and their Snowball, and their Snowcone, and Snowflake would fit in so perfectly, except there's a whole company there that frankly, does way better marketing than Amazon seems to. So, I feel like that is probably a sticking point over there.



Pete: Well, so I think that's a good segue into the fun bits of the S-1, which is, how did it look? You got this company that has been claiming having this stratospheric growth, raising a billion dollars; you got to have some sort of success to raise that kind of money. They had hired this legendary executive out of Microsoft, Bob Muglia, to be the CEO for a period of time, although he's no longer there. And diving into it, some of the high-level stuff that I saw was actually pretty impressive. Their revenue growth is still over 100 percent year over year, and they're doing hundreds of millions of dollars.



I mean, I think they are on track by next year to be doing half a billion dollars in sales at a top-line revenue. And that is growing at 120 percent or something, I think, with some of the things that I saw. It's pretty impressive where you can grow at that rate. It's very similar to what I was seeing with Datadog, and they had a pretty monster IPO, a big company with a big wind behind it. Much different fundraising round. I'm pretty sure Datadog raised like, what, maybe 100 million or less.



So, very interesting how companies are capitalized. But I think the other thing, too, that I was really blown away by—but it should make sense—is they have this—their retention of their customers, called net dollar retention, it's over 100 percent. It basically means that they have to spend a lot of money to bring these customers on, but once the customers are there, they expand their usage. And again, if you've ever used a SaaS service, you can probably understand how easy it is to grow over time. And this seems to be a very sticky service and that means that a lot of their clients they bring on board, they're spending more by the next year, which is always a good sign.



Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



Corey: And it seems like it's definitely everything's up into the right. And also, let's not kid ourselves, data warehouses don't get smaller over time. It's very sticky revenue, it doesn't go away on its own, and when WeWork on AWS bills, one of the questions we ask that we already know the answer to when we ask it is, “Do you need all of this old data?” And the data science team rages, “Of course, we need this very detailed transaction logs from 2012. There's the future of our business in there as soon as we figure out how to unlock it.” I have my own skepticism, but okay, I get it. The industry clearly disagrees with my perspective on these things.



Pete: Yeah, no one likes to delete data. But it's funny, data is a liability as much as it is an asset. In your emails that you send out talking about Amazon S3 Breach Awards—what is it—the S3 Negligence Award, there are a lot of liabilities that come with having data. And for a company like Snowflake, they have all this data in S3. S3 and their, obviously, metadata layer. But there's a risk there that people have to think about as the complexity of where your data lives and who owns it. But you're right, I think, once you have it, you're going to keep using it, and you're going to use more of it.



Corey: One of the interesting things that I always like to look at in S-1s, personally, is conflicts of interest. And on some level, again, everyone has some and I get that, but we promote family values here almost as often as we promote family members. And the air travel stuff that was buried in there S-1 is fascinating to me.



Pete: Yeah. This one was, uh, as I roll through it—again, they have to disclose the stuff. You don't want to not disclose it; there's going to be real issues if you don't, but the newest CEO, who started when Bob Muglia left is named Frank Slootman and he has an aircraft. Like, he owns a plane. Again, I don't know what kind of plane this is. Is it a jet, is it—I don't know anything about private airlines, but it is an aircraft that's in a pool of aircraft by this CTP aviation company, and they're a charter aircraft company.



So, in the S-1, it talks about how the business, the company actually books charter air travel services for the CEO and other employees through the same company, and they even say, “Hey, from time to time, we actually use the CEOs plane for these business trips as well.” And because of that, because of this arrangement that they have, they essentially have to pay a certain amount of money for these flights. And they even listed out how much they pay, were—I think they listed it—let's see here… for the fiscal year ending 2020, it was nearly $300,000 they paid for the use of this private plane. Crazy.



Corey: And it's funny to me—and I'm not calling this out as any form of grift or whatnot—they paid this guy in total comp last year alone $60 million. Which again, given what he's done to the company seems to be reasonable, I guess, I don't know. Most of that's equity-based. But I'm not looking at that and thinking, “Oh, clearly there's maleficence at work.” But it does raise questions to the point where if your CEO owns a private plane and then sells it back to the company, maybe you don't use that particular plane for private travel. Maybe you go with something else, just to keep the ethical issues clear. Am I just naive on that?



Pete: I would agree. I mean, again, it's not a money thing, it's a time thing. I mean, why would you fly private? Why would you charter? Because if you're the CEO of a company, that's probably going to be valued at tens of billions of dollars, your time, super valuable.



You should get to where you're going as quick as possible. And so, you're right though. Does it look questionable when stuff like this happens? I mean, it's buried in there. I'm sure someone might say something like we do but doesn't really change anything? It's nothing like what was seen in some of the other S-1s with kind of strange monetary deals between the WeWork stuff, for example.



Corey: Yeah, the weird disclaimers on a lot of their compensation is they wound up having a specific line item that their president slash head of product received specific compensation for certain travel expenses of his spouse in connection with a company-sponsored event. And I look at that, oh okay, how much money are they talking? The total other compensation line that's referencing is 887 bucks last year, and this guy makes $2.2 million in terms of total comp as the president of products. It's on some level, just eat the expense yourself personally, to avoid the appearance of conflict of interest because we're talking about this now, we're probably not the only people going to highlight this. And it’s just, is the grief worth it?



Pete: Yeah, exactly. It seems silly in a lot of ways, but I'm sure at the time, it was just like, “Oh, yeah, I put my expense report in and got it taken care of,” or whatever. I'm sure it was just a line item, but of course, like, you got a lot of lawyers that are going through a lot of details to create these S-1s to make sure that they disclose everything. I mean, to your point as well, always trying to find interesting things, is that the son of the president of products, who is also a member of the board of directors, but the son was hired as an account executive, that's like entry-level salesperson, cold-calling person. And I feel really bad for his son because they had to list the per year compensation that this person received while employed there, which, sure, you're a member of a family that is going to be entering into a large amount of generational wealth.



Corey: Yeah, in 2019, he made just under $65,000, total cash comp and commissions. Last year, because the company is taken off, he made the princely sum of $126,000. Which, again, I'm not saying that that's nothing money. Far from it, I get it. But by the same token, that's not embezzlement money.



That is the kind of money, you pay a sales executive who's hitting their plan, presumably, for this stuff. In fact, many SaaS salespeople make significantly more than that. And it's clear from those numbers at least that this guy isn't having a whole lot of favors done for him, but I guarantee you you're right, everyone is going to assume, “Oh, it's because your dad is the president of products that you have this job. You're probably making five times more than we are.” It's got to be rough.



Pete: Yeah, I think between that and just like, I've never had my salaries published publicly so that two idiots on a podcast can talk about it, before. That in itself—



Corey: Yeah, right?



Pete: That can't feel very good.



Corey: Oh, yeah. If I wind up disclaiming how much I make in any given year, there's no good answer here. If it's low, it’s, “Wow, he's not making any money at all.” And if it's high, it’s, “Wow, there's good money in that making fun of Amazon business.” There's no upside; there's a lot of downside and I do feel for it, especially since this guy's very clearly relatively new to his career, and this is kind of going to, I guess, cast a shadow to some extent. I feel for the guy, I really do.



Pete: Yeah, yeah, exactly, exactly. But the last thing that I wanted to call out in this one is because we get these startups with this insane growth, you really want to hope that there's people the quote-unquote “normal people” that have been there, just grinding it out day in and day out over a long period of time, you want them to get their win, right? They worked hard, and hopefully they get their shot at getting a slice of this huge success. And so what's cool is they actually listed out all of the various tender offers, secondary offerings, price per share over a period of time. Now granted, they'll only list who sold shares who was director within, so like, they're not going to tell you if Susie from accounting sold some of their shares or whatever, but it does list the share value of those people who sold.



And honestly, I think the only interesting thing is that you're seeing a company whose share price was continually growing at each valuation round, which then makes a lot more sense why they raised so much money if the money was there and the valuation was good. But the other reason I like to bring this up is because this is something that not a lot of people who go to a startup realize, is that there are exit ramps for people who are higher up who are founders that exist, hopefully to the whole business, but oftentimes, it's just to those early founders, those early people. I mean, for example, even though the former CEO, Bob Muglia, is no longer there, in the February and March timeframe this year, he sold about $80 million worth of that stock. Is it better than what's going to IPO for? Who knows, but what I do know is that he got a check for $80 million earlier this year that he can do whatever he wants with it. Are other people internally going to have that kind of win? Probably not. But it's one of those extra things—



Corey: He also left and was either forced out or resigned, it's unclear. Part of the story on some level is it's not often considered appropriate or good optics for a former executive who's no longer there to hold an outsized proportion of the company's value. So, it may very well have been a negotiated drawdown of stake.



Pete: Yeah, I think you're exactly right there, which is he doesn't have a management interest in the company anymore. So, why would you have so much of your net worth, I guess, tied up into there.



Corey: We saw that in some respects with the Bezos divorce, where it was very clear that Jeff Bezos’s ex-wife was going to get a fair chunk of Amazon stock, but she didn't have a role there, so how her being effectively a massive minority shareholder with voting rights was an issue of some question. Is this going to change the way that Amazon is run? In practice, no. And there's a whole story around that, but I don't think that there's anything necessarily egregious going on as far as the previous exec selling out. But I bet it does rankle a lot of the employees who, “Wait, the guy that just left gets the cash out $80 million dollars and we're still being told that the future is bright, but we have no actual dollar figures coming in?” I don't know.



Pete: Yeah, exactly. You know… pays to be a CEO of a billion-dollar company. [laughs].



Corey: It sure does. Thank you so much once again for joining me on this one, Pete. In our next episode, we'll probably tear apart Palantir S-1.



If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts, whereas if you hated it, please leave a five-star review anyway on Apple Podcasts, and thank you for listening.



Announcer: This has been a HumblePod production. Stay humble.


Fri, 04 Sep 2020 03:00:00 -0700
8 AWS Terms Project Managers Need to Know (AMB Extras)

Links Mentioned



Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 02 Sep 2020 03:00:00 -0700
Amazon EC2 Hibernation Bear is High Koala-ity
AWS Morning Brief for the week of August 31st, 2020.
Mon, 31 Aug 2020 03:00:00 -0700
The Logic of Sumo Logic’s IPO (Whiteboard Confessional)

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links

Transcript


Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.


Corey: Welcome to the AWS Morning Brief, what is normally the Whiteboard Confessional slot, but lately, I had such a good time speaking last week with my colleague Pete Cheslock that we're back again today. Say hello, Pete.


Pete: Hello.


Corey: So, as of the day we are recording this, earlier in the week, the Sumo Logic S-1 has been released, which means that Sumo Logic—motto, “We do logs, too.”—also is going public, which seems to be a bit of a flurry lately of companies deciding to, well, to be uncharitable, inflict themselves on the public markets.


Pete: Yeah, it turns out when you take venture capital money, eventually those venture capitalists, they would like to see a return. So, kind of make sense in a little ways, but at the same time, it's just, I guess, another location to raise money.


Corey: One of the problems that I've run into across the monitoring space, as these companies go public is—let's ignore the fact that it seems like none of them seem to be making money in a profitable basis. I mean, I haven't looked at the details yet, but Sumo is losing money, correct?


Pete: Oh, yeah. Yeah, absolutely. Although let's be really honest, that's not really a dig at Sumo. I mean, they all lose money. [laughs].


Corey: And to be fair, they also raised only—quote-unquote, “only”—$340 million while they were private. But there's a strange inflection here around how monitoring companies seem to work in this space. I don't know who sponsors any given episode of this show until after I've already recorded it, so I'm really hoping it's not them, but if it is, our goal is to be authentic. And it seems to me that there's very little differentiation in all of these companies that offer log analysis, for the most part. I mean, ChaosSearch, where you used to work, had something actually innovative in this space where the data lives in S3 and you can query it without having to pay the same extortionate rates that everything else did. But by and large, most of the rest of the players in this space, it seems the differentiator is starting to be marketing. Am I missing something stupendous?


Pete: No, I think you're spot on there, and you can normally see it when you look at a company's S-1. So, that S-1 includes a lot of information within there, but some of the key points are—at least that I kind of look at—are some of their financial statements; I'm just curious what their revenue is, what it costs to bring in that revenue, profit and everything else. But these companies, they break out their operating expenditures across things like research and development, sales and marketing, and for a lot of these marketing companies, you'll find their spend in sales and marketing to be just huge. In many ways, their spend is nearly their revenue. And let's not forget you still have engineers and your Amazon bill that you have to pay for as well. So, they seem to be very marketing-centric because it's a knife fight out there in the monitoring space, monitoring and logging. It seems like every day, there’s a new logging and monitoring company popping up with just a different way of doing things.


Corey: I get that it's a hard space and these problems are incredibly challenging. The challenge that I run into though is, in many cases, I just want a centralized place where I can effectively look at the logs in real-time as events happen, and start looking for specific patterns with various filters, and that's about it. And it seems like that is a somewhat naive use case—which I get—but then every company out there is chasing Splunk in one form or another. Because Splunk was the first company that really did this right, and they charged the appropriately high ransom in order to make that happen, and then everyone else seemed to go through a generation of, “We’re like Splunk, only not horribly expensive.” And then it became increasingly complex and down this entire path to a point where now, I'm looking at any of these tools and it turns out I need to take a class before I'm able to use them effectively, to learn their own variants of SQL, or how to wind up pointing it at some esoteric data source I'd forgotten.


Pete: Yeah, I think—and I've actually had a bunch of conversations with—as you would expect from spending some time at a logging data analytics company—but there's almost like multiple waves of logging that has happened. And Splunk was kind of the first in many ways. They created a revolutionary way of storing data. That was what they built. That was the core technology way earlier than a lot of other people were dealing with this problem.


They also focused a lot in the SIM/SIEM—that's security, information, event management. So, they sold in a lot of ways to these security companies. And then you had companies that started to pop up that were in the more of the monitoring space, like the Datadog and the New Relics of the world. Datadog and New Relic were getting the requests, “Well, we want logging, too. Like, we're paying for this.” And so then they started consolidating on logging.


And then you had kind of the next generation was like, well, it costs too much money to use these hosted vendors, and the reason it costs so much is because they're using these open source technologies to store this log data, so there's no real innovation there, and this next wave of logging companies that exist out there are all like the, “Well, what if you didn't index your data? What if you just tagged it really, really well?” And that's this third wave we're into now, where people are just like, “I can't keep spending the GDP of a small European country on my logging every month.”


Corey: It also appears that they are the leader in ‘Continuous Intelligence,’ which is a term that they of course, invented. It turns out it's super easy to be the leader in an area that you invent. They claim there are five pillars to it. And one of them is multi-cloud adoption, which anyone who's listened to anything I've said on the topic understands that I kind of disagree with that entire premise. The challenge though is that in this S-1 reading about multi-cloud adoption as being an imperative for modern businesses, yeah, the sentences that they put after that don't actually talk about multi-cloud at all. They talk about, “As you continue to sprawl, you need to be able to gather logs from everywhere.” Well, duh. What's your point here? In fact, you talk about multi-cloud being this incredibly important thing that everyone is going to embrace, but I scroll down to their risk section, and they say, and I quote, “We outsource substantially all of the infrastructure relating to our cloud-native platform to AWS.” So, which is it?


Pete: [laughs]. Yeah, it's a extremely good point. And what's also interesting is how many of these companies—you know, Sumo Logic, remember, they were just a logging company to start with. New Relic was just APM to start with. Datadog was just a host-based monitoring to start with.


Watching them all move into each other's territories, but kind of solving what the other one is good about poorly in a way that just you start using more of it—I'm old enough to remember when everyone was like, “Datadog is the best monitoring system that's ever been created, ever,” and I am actually shocked to see more and more people online completely bashing Datadog, completely bashing Sumo, I mean, these bigger companies. It's like, they try to do more, and in the end, obviously, their focus gets shifted, and the thing that made them great, you know, you have these upstarts that will come and take it over again because they've lost their focus.


Corey: It really tends to surprise me just seeing how everyone is chasing Datadog. That’s one of my biggest problems with the monitoring space if I'm being direct. You have these companies that are very good at a particular niche or a particular area of this where, oh, great Sumo Logic. I've been a Sumo Logic customer and a happy one. Please don't think this is me dunking on a company that I have no affection for, or dunking on them at all, but at some point, there's this entire shift.


I don't know if it's driven by investors. I don't know if it's customer demand, but they always succumb to the Datadog problem of, “All right, now that we've handled logs or application performance monitoring or insert whatever thing that they've become known for, now we're going to effectively do all of the monitoring pieces and become the next Datadog.” Why? Is that something that people actually want or need? I don't think it is.


Pete: Right. I think it's just a following the market. Using kind of your ecosystem—so if you're a Datadog, and you've got this agent, you're doing all this great host-based monitoring, conceptually, adding logging, if you already have the agent in place, should be a really easy way to convert those customers and increase the average deal size. I remember looking at the Datadog S-1 when it came out—because I'm a total nerd when it comes to reading these things, understanding how businesses work—and what was shocking was actually how little net loss they had.


Like, of course they lost money because they were plowing every dollar into growth, and it's showed; they were doubling their revenue year over year, which when you're doing a $350 million run rate, that's pretty big if you're continually doubling every year. And their loss was something like $1 million of net loss. It was, like, nothing. And comparatively, the Sumo Logic net loss is somewhere, from 2019 to 2020, they went from $50 million in loss to nearly $100 million in loss, so you kind of look at it and you go, “Wow, that's a pretty big difference from the Datadog one,” which, again, my guess is, is that Sumo Logic is going to be compared to Datadog when they go to the public markets, So, I think it'll be interesting to watch.


This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.


Corey: For what it's worth, I find that I want to have the logs live in a particular place. I want to have, in an ideal world, a single dashboard that shows me the status of everything I care about. Unfortunately, I live in reality, and I know that it never works out that way. Single pane of glass is incredibly single painful, and I don't think there's any way you get over that. But I really wish that when I go to a particular vendor for the one thing that they're good at and great at solving, that they don't try and shove an entire boatload of other things down my throat, in which they are clearly second-tier at best.


Pete: Yeah, one thing I think on this S-1 that I thought was curious—and I know you read a lot of S-1s as well, especially for these tech companies that have popped up more recently—


Corey: I have super bad habits and no life.


Pete: [laughs]. But I feel like I've read some S-1s that have come out that have been much more explicit, in the company that was filing, in the commitment to Amazon. The Enterprise Discount Program, the EDP program where you have to make $50 or $100 million commitment over a certain number of years. There are companies that go public and they get blasted about how, oh, so and so has a $400 million commitment to Amazon, as if that's huge when they're revenues and their usage is so high. But in this S-1, I barely saw any reference to Amazon other than a couple of call-outs and just a generic like, “Hey, we have this hosting commitment of $40 million.”


Corey: Yeah, 36.9 for 2021 and 27.29 for 2022. They've already said most of their—effectively all their hosting is on AWS, so yeah, maybe they're completely overshooting their commitments. Maybe they had a much larger commitment that they’ve then powered through. And again, that's fine. I have no problem with any of the numbers that I'm seeing here. It's nice that they don't have commitments that are more than their revenue, which we've seen a few times now. I don't see any problem here as far as what I can deduce from the tea leaves that are their commitments. I really don't find too much that’s objectionable in this space, which is a welcome change.


Pete: [laughs]. That is very true.


Corey: One thing I've also noticed is that if we look at the year 2020—sorry, the year ended—their fiscal 2020 apparently ended January 31. Great. Good for them—where they paid their CEO $3.7 million in terms of all stuff together. Okay, great. Their CFO made a bit under half a million for the year. Okay… I don't know what the backstory is there and their chief revenue officer made a bit over three-quarters of a million. None of that is egregious, as opposed to, you know, Rackspace’s S-1 somewhat recently, where it was disclosed that they were losing $100 million a year—net loss last year—and a third of that was compensation to their CEO.


Pete: Yeah, I mean, when you look at those types of numbers for a company, where just such a huge amount is some sort of strange stock-based compensation. Obviously, that's where WeWork—kind of the beginning of the end of WeWork, when people started seeing all these strange compensations. It's a pretty big red flag out there, that's for sure. I think the other place that I always like to look at, too, is the section that talks about that principal stockholders. Mostly out of just, kind of, morbid curiosity of, for a company like this, so Sumo Logic that has been around for quite a while, so according—I'm on Crunchbase right now—so they've taken all the way up through Series G, they’ve raised $300 and something million, they've been around—I’m trying to find when their first year was, but, you know, long enough to raise a Series G round, you have to imagine that for a founder of this company, your original equity stake, right, if it's you and a partner is 50/50, how much do you actually have at the end?


And interestingly enough, the actual percentage held by any individual person within the company, CEO is about 4.7%. Some of the other people that are listed here, either, like, co-founder is, like, 3.4%. Admittedly, this is actually pretty low, and I do remember from the Datadog S-1 that the founders there—that was kind of considered a runaway success—the founders there had at least 10% or more, so kind of speaks for some headwinds they might have ran into during some of those fundraising rounds. But—


Corey: Let's also point out that early in those days before they'd raised a bunch of those rounds, they were giving probably what, if they're like other companies, somewhere between 0.1 and 0.25% of equity to employees as a sweetener to, “Hey, come work here.” If the CEO and founder was diluted down to 4.7% when they presumably were starting with, what, 50% then it kind of makes you wonder what the employee story on this is. Is this one of those outcomes where the founder gets to buy a boat, and the employees get to buy a used Toyota?


Pete: Right. And honestly, in many of the cases, that's usually how it ends up. I think the other thing, too, that I hadn't gotten through it because usually it's probably buried away in some of the small wording, but oftentimes, too, it'll list out kind of secondary market sales of shares, so there's no telling that at one of these big rounds, I'm sure that there was an opportunity. Yeah.


Corey: Oh, you can't tell what happened to employees on this, but it’s one of those areas where whenever I see people at early-stage startups talking about how their equity is going to make them a millionaire, it's, “I don't know about that.” There's a lot that can happen between the founding of the company and going public that looks a lot like dilution. You're required on some level to assume that your founders are going to be at least passable a negotiator as the people across the table who do this every week; that they're going to, in some cases, put their personal interests behind those of their staff. It's an interesting problem. You have an MBA—sorry to out you on that one—you've seen a lot of stories like this, probably more than I have and can do an actual analysis rather than me just flapping my gums in the breeze, but I'm curious to know what your thoughts are on this.


Pete: Like I said, I go to this section because as a person who's done just a million different startups and have always had that same thought in my head of, “Oh, this is where I'm going to go make all this money or whatever else.” The reality is, by looking at these you have to see just how long it takes to get this kind of percentage of your startup, and like you just said, Corey, you have to survive through so many different things just to get there, just to get to the point where you IPO. Recent companies like Elastic and Datadog, now Sumo Logic that have been around for what, eight to ten years. You have to survive a lot to get eight or ten years as a founder of a place, to not be completely ousted by your board, you know, who knows what would happen? There's just so many scenarios that could go down. But meanwhile, it's a company that is growing. Their revenue is growing, their top line is growing, and they believe that public markets are the most inexpensive way to raise additional money, so kudos to them because it's a huge accomplishment and I always hope, I hope the employees get to share in that and get a little bit of a payday and hopefully do something nice with that money.


Corey: Yeah, hopefully so. Thank you for taking the time once again to rant about industry news with me, Pete. It's appreciated.


Pete: Always fun to flex very lame MBA muscle.


Corey: Excellent. This has been the AWS Morning Brief. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts, whereas if you've hated it, please leave a five-star review on Apple Podcasts anyway.


Announcer: This has been a HumblePod production. Stay humble.


Fri, 28 Aug 2020 03:00:00 -0700
Route 53 Query Logging (AMB Extras)

Links Mentioned



Sponsors

Sponsors



Never miss an episode



Help the show



What's Corey up to?

Thu, 27 Aug 2020 09:25:20 -0700
Comfortably Spit a Rat
AWS Morning Brief for the week of August 24, 2020.
Mon, 24 Aug 2020 03:00:00 -0700
Whiteboard Confessional: Google’s Deprecation Policy

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links


Transcript

Corey: Normally, I like to snark about the various sponsors that sponsor these episodes, but I'm faced with a bit of a challenge because this episode is sponsored in part by A Cloud Guru. They're the company that's sort of famous for teaching the world to cloud, and it's very, very hard to come up with anything meaningfully insulting about them. So, I'm not really going to try. They've recently improved their platform significantly, and it brings both the benefits of A Cloud Guru that we all know and love as well as the recently acquired Linux Academy together. That means that there's now an effective, hands-on, and comprehensive skills development platform for AWS, Azure, Google Cloud, and beyond. Yes, ‘and beyond’ is doing a lot of heavy lifting right there in that sentence. They have a bunch of new courses and labs that are available. For my purposes, they have a terrific learn by doing experience that you absolutely want to take a look at and they also have business offerings as well under ACG for Business. Check them out. Visit acloudguru.com to learn more. Tell them Corey sent you and wait for them to instinctively flinch. That's acloudguru.com.



Corey: Welcome to the AWS Morning Brief. In lieu of the Whiteboard Confessional’s traditional approach today, I want to talk about a different kind of whiteboard issue. Specifically the whiteboard interview you wind up taking at Google, which is just a giant red herring because the real question is “How well did you erase the whiteboard afterwards?” so it aligns with their turning stuff off that people love policy. I'm joined this week by my colleague, Pete Cheslock from The Duckbill Group. Welcome, Pete.



Pete: Hello.



Corey: So, we're talking today about Steve Yegge’s article that went around the internet three times over the weekend, and it was titled “Dear Google Cloud, Your Deprecation Policy is Killing You.” Normally, you would think that that would be some form of clickbait headline, but it's not. It was a massive 23-minute long read, as per Medium. And we will, of course, throw a link to this in the [show notes]. But, Pete, what was your take on this thing?



Pete: Well, I missed it on the first go around, but when you sent me over the link, and the first thing I saw was Medium saying, a 23 minute read, and you had told me how this post had blown up. I think that really speaks for how incredibly well written this post is about this particular issue, that people in this world are willing to invest 23 minutes to read it. I was locked into it the whole time. It held my attention the whole time because of just how deep it went into Google and just how they operate.



Corey: Steve Yegge is famous for doing the platforms rant back in 2012 or so. He's a former Amazon employee who I think spent something like seven years at Amazon, about an equal time at Google, left to go run architecture at Grab a couple of years back, and then, due to these unprecedented times, is now independent/doing his own thing right now. So, that is an absolutely fascinating trail because when he writes about this stuff, he knows what he's talking about. This isn't one of those, “Eh, I’m just going to go ahead and pen something that's poorly articulated, and see what happens.” What's more amazing is I haven't seen much in the way of pushback on this. The points that he hits in this article are pretty damning, and even people from Google are chiming in with, “Yeah, that tracks.”



Pete: Yeah, and for all those listening that maybe haven't read this yet, maybe going to go read it after listening to this. What the real, I guess, crux of this post is about is how Google aggressively deprecates things and the kind of culture within Google that really drives that world to happen, and how just opposite it is to a company like Amazon. I think my biggest takeaway from this was this light bulb, “Oh my goodness, it all makes sense now,” idea of how aggressively Google deprecates things has to do with code cleanliness. They don't like five different APIs to do the same thing, so they'll deprecate four of them and keep things clean and whatever. And what I think is really interesting, too, that I read in here is how internally, this works great for Google Because they have all these tools that can automatically update code, and update APIs, and let people know if a deprecation is happening. But he compares this to, like, Java and Emacs, which historically, take decades—if ever—fully removing APIs. It was a really fascinating read.



Corey: It really was and one thing that stuck with me was, it makes perfect sense in hindsight. If you are Google and can dictate how all of your employees write software that makes it into production and have automated tooling to go back and handle deprecations for you, then great, that does work. The problem is, is that the rest of the world is not like your internal engineers. The problem I see behind Google Cloud, by and large, is that it assumes that everyone tends to write software the way that Google engineers would. That's not a valid assumption, I assure you. I write software that is nothing like anyone who calls themself a software engineer would ever write, but your cloud offering has to support my nonsense.



Pete: Right. Compare this to Amazon. So, that's one of the biggest other takeaways that really hit me. And this came up when I think I was looking at a cloud bill and seeing SimpleDB still on it. SimpleDB, which they don't really market it, they don't tell you to use it, it's not part of design things. Although, Corey, you can correct me if I'm wrong there, if it's included, if it's really talked about much anymore, I don't think it is—



Corey: No, they've tried to bury it, but they are still hiring for that team from time to time.



Pete: That's—yeah, I remember you had mentioned that. And so think about that. Think about the m1.medium, right. I think m1.medium was the first instance type back in 2006—



Could you imagine if Amazon deprecated something related to Lambda or DynamoDB? I mean, it would be catastrophic for organizations to have to make these changes to their codebase.Corey: It was either m1.medium or m1.large. I don't recall.



Pete: It was that m1 class, both of which you could run today. I've definitely seen instances of recent launches of that class of instance, and, to your point, SimpleDB they're hiring for, and you can provision that in places. They just—Amazon does not turn anything off in their cloud, and that's huge. And to Steve's point in his post, he doesn't have time to go through and migrate all his code to the next new great thing.



This episode is sponsored in part by our good friends over at ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.



Corey: Exactly. He starts off with an anecdote when he was a Google employee, where he'd spun up some big table instance and got an email from an internal team saying you're running a very, very old binary in this data center, and it was years after he'd spun this thing up. And it was, we'd like to help you upgrade to the latest version, or migrate your workloads. Let us know what we can do. That's the kind of outreach you get when Amazon eventually decides to deprecate an AWS service.



So, what I like about this entire rant, as it were, was that he's speaking from the perspective of a current GCP customer, and this stuff is driving him up a wall. It doesn't make the ‘Killed by Google’ lists. But the old version of a service API is being deprecated in favor of a new one, so you have to go and patch your code is apparently a very common GCP problem. And yeah, I look at some of the stuff that I spun up in my account back when I was first learning Lambda in 2016, and sure, it's running in deprecated runtimes, but it's also still running. If I want to update anything, then I've got to change the runtime, and in almost every case, I'm not using advanced language features, so it's fine.



But these things are still sitting there, just working. I have heard stories from friends at AWS accounts that have been there since, basically, Bezos was in diapers. And when they're using things that wind up getting deprecated, it's an increased series of very polite, very solicitous emails, that if they continue to fail to respond to results in a personal outreach from an engineer to, “Hi, we really need to turn that thing off. Can you please work with us to help migrate on to something else? We'll help you do it.”



Pete: Yeah, I think the other point that he really touched on about, you know, it really comes down to backwards compatibility. When he talked about Java and its historical support for old APIs, deprecated API notices when you launch an application, and even to the point I think he talked about the video game, the game that he has been working on for a couple of decades now still runs. He has not had to modify things; he's using specific deprecated APIs. But he called out specifically the great, kind of, Python 2 to Python 3 mess, and really talking about the fact that that was such a hard shift—if you have a bunch of code in Python 2, and there's no backwards compatibility for whatever you're using in Python 3, are you going to spend the time to rewrite that code into Python 3, or would you adopt a new programming language?



And he even specifically says this. He says, “How much Python software—” I'm just quoting from his blog right now, “How much Python software was rewritten in GO (or Ruby, or some other alternative) because of that backward incompatibility?” And so, he even went on to just be like, “Listen, what if Apple did the same thing, and just breaking all this backwards compatibility? And you do this and maybe you lose 10 or 20 percent of your user base. You do that a few times, and your user base is going to just erode before you.”



Corey: Yeah, two or three times of doing that, and you've lost half your users, which is not a good metric that any platform wants to see.



Pete: It's just incredible, yeah. And I think he had a great line, which was, like, my upcoming blog post, like, “How to Aim for Number Two, and Miss” or something like that. But it's true because if you're in enterprise, and you're using Google Cloud because let's say you sell something competitive to Amazon, or I think the common trope is you want to sell something to Walmart and you're offering runs on Amazon. If you're a big enterprise, and you're committing these resources—enterprises, they don't upgrade things. They just normally build new things. Old things stay around forever. It's the great adage of, “Broken gets fixed, but shitty lives forever.” How are these enterprises going to respond when the rug gets pulled out from underneath them at all these intervals? My guess is, is that places like Azure and Oracle look a lot more attractive.



Corey: That's what Google's missing. For all its faults, Microsoft is very good at providing obnoxiously extended support. I've heard whispers that there are still Windows XP customers in some manufacturing floors that are getting long term support even now because the enormously expensive piece of equipment requires something that only runs on XP to function.



Pete: Right. And if you imagine, even, like, XP Embedded, I had a car navigation system from 2004 that ran XP Embedded. That car is still on the road somewhere and may need some sort of support. And again, that's just a car. What about all the other systems that are out there?



It's definitely something that I would keep in mind, I would think about when I'm building an application in the cloud world, I'm not going to want to have to spend my very limited and very expensive engineering time to constantly change these APIs that are getting deprecated, which I think most importantly, these APIs getting deprecated, these may not actually bring value to your product or your business. It might just be toil; it might just be busy work that your company has to do because your cloud provider keeps making you. But all that is going to be an opportunity cost wasted away, and you're not focusing on the thing that hopefully brings your company money.



Corey: What just kills me about this entire piece is how there's a chorus of violent agreement around this. If you want people to use your cloud, on some level, you can't expect them to update things that are already working correctly, from whatever it is that they've built on this, that, “Oh, there's going to be a new library version that if you don't update this within the next year, it's going to start breaking.” People don't generally want to do that. If you're going to update something that you've built and shipped into production, you want it to be on your timetable, not because your upstream vendor decided that it was time for you to upgrade now that they come up with something better/different.



And a lot of these are the same services that are getting deprecated and updated. It feels like you're on a treadmill, at which point one of the more contentious things that he said was, “It is less DevOps effort to run these open-source versions of whatever it is that you're using instead, and just manage the VMs because at least that stuff's going to work.” Now, people have argued about that, but I disagree that it's necessarily the wrong approach. Yes, it's more work on some level, but you also get to determine where that work takes place and how it’s functioned. You aren't beholden to someone else's timelines in quite the same way.



And as we look at a lot of cloud migrations, people are still struggling with how to migrate their freaking mainframes from the 1970s. That's a long time. How much PHP written in the early 2000s is still an active production? You've got to understand that customers do terrible things, and if you're not willing to support those things, maybe you shouldn't be a cloud provider.



Pete: Yeah, I mean, towards the end of this article, this post, he’s like, “Listen, they want you to use this. They want you to use their cloud. They just don't want to support anything because support is not part of that Google DNA.” And he even says, “Listen, the engineers support each other,” because he called out that big table anecdote where this engineering team reached out to help support him.



But you got to remember, too, is that they reached out because he was running something old. They don't like that. It's not clean. They need to go and upgrade this thing sitting around, and they're willing to help you to do that. What's interesting, though, and I'd love to hear from any larger customers that have maybe a more of an economic advantage with Google—although does any really exist for GCP—but, who have been able to push back on some of these deprecations? Can they push back and use money as a way for a company that truly doesn't need money? I mean, they just—it's so much ad money, that it's almost like, what's the point of Google revenues? I mean, it's really fascinating to see.



Corey: It really is. Pete, thank you for taking the time to rant about this with me. Me ranting in a monologue about this would be far less coherent. It's challenging to wind up and start talking smack about GCP, just because I'm perceived—rightly or wrongly—as an AWS fan. Which I am, but I want Google to get better at this. I want there to be a viable Google option. I don't want the default answer to be AWS for everything. So, we'll see what comes out of it.



Pete: Yeah, thanks a lot for letting me join. I think just reading this and being kind of blown away by it was a really a great way to start the day for me, so I appreciate it.



Corey: Thank you for joining. Pete Cheslock, cloud economist here at The Duckbill Group. Along with me, Corey Quinn, chief cloud economist at same, here on the AWS Morning Brief: Discussions.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 21 Aug 2020 03:00:00 -0700
Cloud Repatriation Isn’t a Thing (AMB Extras)

Links Mentioned



Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 19 Aug 2020 05:13:38 -0700
AWS Observerless Now GA

AWS Morning Brief for the week of August 17, 2020.

Mon, 17 Aug 2020 03:00:00 -0700
Whiteboard Confessional: The Case for Internal Tooling

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Links



Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.



In almost any production environment, there's going to be a few tasks as your company grows that someone winds up having to perform in your production app. And in many cases, the people who have to perform those tasks are themselves not excessively technical, which means if you fail to properly invest in internal tooling, well, that means you're going to have someone who winds up getting this, effectively, printed out page that hangs in their cubicle—or equivalent during these uncertain times—where they wind up following a checklist of, step one: SSH into a production server. Step two: copy and paste the following command, which in turn, I don't know, spins up a Ruby on Rails console, or does some task on the database and returns a query. Now, this is universally recognized as awful because, for better or worse, most business users are not overwhelmingly comfortable when it comes to using SSH on the command line.



Now, in an ideal world with unlimited resources, you would be able to have an internal tools developer who could focus on things like that specifically for your teams. And in fact, most very large hyper-scale companies have entire herds of people doing nothing but that. But when you're building something from scratch, and you're a relatively small, scrappy team, it's much more challenging because you take a step back and have to make some unfortunate and challenging determinations of, “Okay, am I going to A) sit here and have very expensive people build tooling, or B) have them work on features, which, you know, bring money into the company?” I'm not going to sit here and say that people are wrong for not investing in internal tooling early on.



But at some point, the longer you go without making those investments, the greater your risk is because someone is going to get something wrong. They're going to fat-finger a command somewhere; they're going to run it on the wrong system; a key pair is going to not do what it needs to do; some error-checking was not built into whatever script you're having them run, and a command is going to fail, but it's going to continue on as if it succeeded and potentially run the wrong thing in the wrong place. It effectively is setting up a recipe for disaster, and when this happens, as it inevitably will, the natural response is going to be to blame the poor schmuck who had to go ahead and run your crappy shell script command because you couldn't bother to invest in internal tooling. This is an area that's near and dear to my heart because it's something that I spend a fair bit of time worrying about myself. Again, I've built a ridiculous architecture that powers my newsletters, and I have a separate aspect of that, that lets my ad sales folks wind up injecting sponsor stuff into the newsletter for me.



Fun fact that isn't super well known, I don't see any of the sponsor stuff that goes out in my newsletter until after I've already written that week's issue because I don't want to wind up finding myself having to change what I say to avoid irritating a sponsor, you know, like someone with a sense of self-preservation or an appreciation for maintaining their income might do. So, it's sort of an editorial firewall for me. In order for that to make sense, though, there was no way in the world I was going to get away with having people who are managing the ad sales portion, SSH-ing into a box, and running this arcane script that talks to DynamoDB. And, “Oh, yeah, just run this script; it invokes a lambda function, and—hey, where are you going? Come back,” is how that story is going to play out.



So, my initial approach was to look into what it would take to pay someone who's good at building web forms and front-end tooling. It turns out those people cost a lot of money. My approach was to ultimately use Retool, which I've talked about repeatedly on this show, but there are a lot of tools in this space. AWS Honeycode, for example, is one of the worst examples of something like this. The value there is that it ties together a bunch of APIs with a drag-and-drop Visual Basic style interface that lets you build internal web apps.



And their pricing model is such that you would never in a million years use this for anything public. But for internal tooling, it's a great approach. Sure, you need some developer time to set up the APIs, or the scripts that it calls on the back end, but it's really an accelerated function here because you don't need anyone to spend time on UI, past drag and drop. When it comes time to update something, you can wind up changing an API parameter or building a quick API on the other side and the interface remains remarkably consistent for users. There are a number of tools like this out there, and I'm a big fan of the no-code/low-code movement, specifically because it solves incredible business issues here.



This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.



Now, please don't misunderstand me. I’m not having this conversation to shill for any particular product or service unless they're sponsoring this episode. I'm talking instead about the higher-level pattern of making sure that you take the time to invest in tooling before it winds up blowing up in your face, and you wind up inevitably blaming someone for something that isn't really their fault. Again, I run a business; I am extremely sympathetic to the fact that there's an infinite amount of work, and for most of us who are not funded by SoftBank, there's not an infinite amount of resources to throw at that infinite amount of work. So we make tradeoffs, we make decisions, and yes, there's always going to be something that is overly complicated and technical because either it doesn't happen frequently enough to wind up investing in tooling around, or it's complicated, or it's simple enough that, just run this one command and it'll be fine.



That can get you by for a surprising period of time, but eventually, someone is going to copy and paste something wrong, and it's going to lead to disaster. So, fundamentally, what I'm suggesting and advocating for here is invest at least a little bit of effort in getting to a point of internal tooling that doesn't require four to eight hours of training someone how the Linux command line works, which is nutty in this year of our Lord 2020, and give them something that looks a lot more like an internal web page. Now, as we have learned from a very public recent Twitter hack, you're going to want to be careful with how you handle access to said internal tools because at some point, what you're building is fundamentally going to look increasingly like an admin panel. From my perspective, for sending out my newsletter, there is no button inside any of these systems that will cause a newsletter to send.



This sounds like it’s an intelligent safety approach, but it's not. In fact, it's a limitation of ConvertKit’s lack of a broadcast API. In fact, after all of my highly technical stuff finishes, I have to copy and paste the HTML, like some kind of farm animal, into a web page. Now, that has a whole series of problems but the silver lining behind it is that if someone were to break into my newsletter production system, which is possible, all they would be able to do would be to muck up some of the content and delete some stuff, and there are backups of all of these things. It isn't going to ever get to a point where someone has gotten access to this stuff and now my career is ruined, or I have spammed a bunch of nonsense to my newsletter subscribers. Be very clear here: the stuff that I spam my newsletter subscribers with is highly intentional, and it is my, basically, rambling equivalent in text form of this newsletter.



So, we've covered a fair bit of ground here. In summation, invest in internal tooling insofar as you can, understand there are going to be times that you're not going to be able to make those investments, and gain the wisdom to know the difference between those two scenarios before it blows up in your face, and you blame the wrong person for your own shortcomings.



This has been the AWS Morning Brief: Whiteboard Confessional, I'm Cloud Economist Corey Quinn. And if you've liked this podcast, please leave a five-star review on Apple Podcasts, whereas if you've hated this podcast, please leave a five-star review on Apple Podcasts as soon as you build a tool to do it for you automatically.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 14 Aug 2020 03:00:00 -0700
Disaster Recovery (AMB Extras)

Links Mentioned



Sponsor



Never miss an episode



Help the show



What's Corey up to?

Wed, 12 Aug 2020 13:10:42 -0700
Don't Hate the Player; Hate the Name

AWS Morning Brief for the week of August 10, 2020.

Mon, 10 Aug 2020 03:00:00 -0700
Whiteboard Confessional: Secrets about Secrets Management

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links

Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.



Welcome. I am Cloud Economist Corey Quinn, and this is the AWS Morning Brief: Whiteboard Confessional. One of the nice things about how I do business is that I don't actually know when I record these episodes, who is going to be sponsoring it. Today, I'm going to talk about secrets management. The reason I bring this up is that should whatever sponsor has landed the ad slot for this week be talking about a different way of handling secrets management, you should of course disregard everything I'm about to say, and buy their product and or service instead. That said, let's talk about secrets management and how it can be done in some of the most appalling ways imaginable.



There are a depressing number of you listening to this, where if I were to steal your laptops, A) you potentially would not have hard drive encryption turned on, so I could just pull things off of your system. That said, most modern operating systems do this by default now, so that's less of a threat. Now, let's pretend that I wind up instead surmounting an almost impossible barrier. That's right, getting a corrupted browser extension onto your system that somehow has access to poke around in your user's home directory.



Think for a second about what I might find. Would I find, oh, I don't know, SSH keys that would grant me access to your production environment? Well, that wouldn't be that big of a problem because there's no possible way I would know what hosts they go for unless I look at the known_hosts file sitting right next to your SSH keys. But even that's a little esoteric because that's not something I would ever do at grand scale. Let’s instead consider what happens if I poke around in the usual spots and find long-lived IAM credentials, or whatever your cloud provider of choice’s equivalent is, which I believe is IAM in most cases unless you're using IBM Cloud, in which case, it's probably an old-timey skeleton key that is physically tied to your laptop.



Now, the reason this becomes a common pattern is because it's honestly pretty convenient. You're going to need to be able to access production environments or your cloud environment, and have permissions that are generally granted to you, and ease of access is always juxtaposed with convenience. And invariably, convenience tends to win out. Sure, you can mandate the use of multi-factor authentication for those credentials to get into production, but that means you have to type in a code or press a button on a Yubikey, or something else. That fundamentally means you're going to be spending a lot more time pressing buttons or digging out passphrases than you're going to spend getting into production in a hurry.



So, we make trade-offs; we cheat; it's human nature. And of course, once you get into your production environment, things are rarely better. It seems that you have a choice. You can either have the same password shared absolutely everywhere within an environment, or you have these incredibly secure key management systems, but in return becomes virtually impossible to rotate credentials. We've seen this before, and we've talked about this before. When we look at what happens when someone leaves a job unexpectedly, and suddenly the credential rotation causes four site outages in the next two days.



There's always a trade-off here. And the problem is, is that these elaborate multi-step secret retrieval processes that people can deploy are no stronger than their weakest link. I've talked about it in an early episode, but probably one of the most bizarre I've ever seen was for regulated data, where in order to start the database server, it required a long key that was cut into pieces, and then we needed to have multiple staff contribute and turn their key like we were launching a freakin’ nuclear missile from a submarine. And it worked, sure, but at the same time, it meant to restart a server, you needed at least two people nearby, and that became a little nutty. Let's also ignore for a minute the fact that this was just for encrypting the data at rest.



Once the service was running, it was loaded into RAM. There was no real guarantee that this was going to be any more secure than anything else. And let's face it, we're living in an era now where people stealing the server out of our cloud-hosted environment is not the primary or secondary or tertiary threat modeling that anyone has to do. For better or worse, you can give an awful lot of crap to the cloud providers, but they've pretty much solved the ‘someone rams a truck into the side of the building, grabs a rack into the back of said truck and peels off into the night.’ Except IBM Cloud. So, what are some patterns that work for this? Great question. But first:



Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



Hopefully, that ad was not about secrets management. Again, if it was, please disregard everything I'm saying and buy that product or service instead. Now, there are tools out there that will solve this problem for you. HashiCorp Vault is a good example. And over in the world of AWS, you have a couple of options. You have Systems Manager Parameter Store, which is free but has a long-winded, stupid name, or Secrets Manager, which does exactly what it says on the tin, but costs 40 cents per secret per month. The question is, is it worth 40 cents per secret for you to wind up avoiding a stupid name? It certainly is for me. Snark aside, one key differentiator, that I'm a fan of, is secrets manager lets you invoke a Lambda function during credential rotation, which means you can teach it how to talk to any arbitrary database system you've got, run some script that winds up updating the credential, and then, effectively, it is push-button, rotate credential globally.



This gets into the larger-scale pattern of things that are scary or dangerous, that scares the heck out of people are exactly the sort of things you should do more of, “Well, we haven't rotated our passwords or certificates in three years because the last time we did, it caused an outage,” is almost always the wrong direction to go in. The better approach for sensible human beings is, “Ooh, that was difficult and painful. How do we do that enough so that, A) it becomes routine, and B) it becomes something that we can build automation around, so it's less fight the wolf to a standstill and more push the button?” This incidentally is one of the dangerous parts historically about SSL certificates having these incredibly long expiration times. In fact, a number of browsers are now not going to honor certificates with expiry periods of longer than one year, and that's kind of a good thing. Let's Encrypt, the free certificate manager only gives validity of 90 days, which means you're basically forced to automate this away, which is great.



Otherwise, in the olden days, we had these five-year validity windows for certificates, so by the time a certificate expired, A) the people who'd set it up were long gone, and frankly, working with open SSL command lines in the blessed place was always a question mark, and B) these certificates had spread so far within the organization—by hand—that no one knew where all of them lived. And the way we found out was when these certificates expired, invariably at five in the morning on a weekend, when we could least afford the downtime or a person to look at this. Honestly, every time you try and pull up a website that has an expired certificate, you sort of shake your head and wonder who dropped what ball. It certainly doesn't give you any degree of confidence in their technical competence. Frankly, I disregard blog posts I read when I'm confronted with a certificate error. If I was confronted by an expired cert to log into my bank, I've got to say, it's painful, but I would probably find a new bank.



So, production is one beast, but your laptops are another. One pattern that I'm a big fan of that kind of works with both is the idea of forcing credential rotation on a cadence. Some tools, like AWS vault, will do this in the background automatically. What I'm a big fan of in the world of EC2 is using instance roles because those are automatically rotated credentials that have a validity window of less than a day. So, if something gets compromised, there's a very limited window of validity during which time they can cause damage, as opposed to—let's face it—on your laptop, your IAM key pairs and SSH keys probably are damn near old enough to vote, for some of you.



So, in conclusion, take a look at what your risk exposure is with credentials. Understand that there's a spectrum of good ways to solve this, bad ways to solve this, and despite when anyone tells you about how awesome their approach is, invariably there is someone in their environment who is doing it completely wrong.



This has been the AWS Morning Brief: Whiteboard Confessional. I'm Cloud Economist Corey Quinn. And if you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you've disliked this podcast, please instead leave a five-star review on Apple Podcasts and a copy of your latest certificate pair.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 07 Aug 2020 03:00:00 -0700
Multi-Cloud is the Worst Practice (AMB Extras)

Links Mentioned



Sponsors



Never miss an episode



Help the show



What's Corey up to?

Wed, 05 Aug 2020 07:30:00 -0700
Drastic Load Balancing Code Changes
AWS Morning Brief for the week of August 3rd, 2020.
Mon, 03 Aug 2020 03:00:00 -0700
Whiteboard Confessional: The Bootstrapping Problem

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links



Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



Corey: This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.



Hello, and welcome to this edition of the AWS Morning Brief: Whiteboard Confessional, where we confess our various architectural sins that we and others have committed. Today, we're going to talk about, once upon a time, me taking a job at a web hosting provider. It was the thing to do at the time because AWS hadn't eaten the entire world yet, therefore, everything that we talk about today was still a little far in the future. So, it was a more reasonable approach, especially for those with, you know, budgets that didn't stretch to infinity, or willingness to be an early adopter of someone else's hosting nonsense to go ahead and build out something in a data center.



Now, they were obviously themselves not hosting on top of a cloud provider because the economics made less than no sense back then. So, instead, they had multiple data centers built out that provided for customers various hosting needs. Each one of these was relatively self-contained unless customers wound up building something themselves for failover. So, it wasn't really highly available so much as it was a bunch of different single points of failure, and an outage of one would impact some subset of their customers, but not all of them. And that was a fairly reasonable approach provided that you communicate that scenario to your customers because that's an awful surprise to have later in time.



Now, I was brought in as someone who had had some experience in the industry, unlike many of my colleagues who had come from the hosting provider’s support floor and promoted into systems engineering roles. So, I was there to be the voice of industry best practices, which is a terrifying concept when you realize that I was nowhere near as empathetic or aware back then as I am now, but you get what you pay for. And my role was to apply all of those different best practices that I had observed, and osmosed, and had bluffed, into what this company was doing, and see how it fit in a way that was responsible, engaging, and possibly entertaining. So, relatively early on in my tenure, I was taking a tour of one of our local data centers and asked what I thought could be improved. Now, as a sidebar, I want to point out that you can always start looking at things and pointing out how terrible they are, but let's not kid ourselves; we very much don't want to do that because there are constraints that shape everything that we do and we aren't always aware of them. So, making people feel bad for their choices is never a great approach if you want to stick around very long. So, instead, I started from the very beginning, and played, “Hi. I'm going to ask the dumb questions, and see where the answers lead me to.”



So, I started off with, “Great, scenario time. The power has just gone out. So, everything's dark, now how do we restart the entire environment?” And the response was, “Oh, that would never happen.” And to be clear, that's the equivalent of standing on top of a mountain during a thunderstorm, cursing God while waving a metal rake into the sky. After you say something like that there is no disaster that is likelier. But all right, let's defuse that. “Humor me. Where's the runbook?” And the answer is, “Oh, it lives in Confluence,” which is Atlassian’s wiki offering. For those who aren't aware, Wikis in general, and Confluence in particular, is where documentation and processes go to die. “These are living documents,” is a lie that everyone says because that's not how it actually works.



“Cool. Okay, so let's pretend that a single server instead of your whole data center explodes and melts. When everything's been powered off, you turn it back on. That one doesn't survive the inrush current, and that one server explodes. That server happens to be the Confluence server. Now what? How do we bootstrap the entire environment?” The answer was, “Okay, we started printing out that runbook and keeping it inside each data center,” which was a way better option. Now, the trick was to make sure that you revisited this every so often, when something changed, and make sure that you weren't looking at how things were circa five years ago, but that's a separate problem. And this is fundamentally a microcosm of what I've started to think of as the bootstrapping problem. I'll talk to you a little bit more about what those look like in the context of my data center atrocities. But first:



This episode is sponsored in part by our good friends over a ChaosSearch, which is a fully managed log analytics platform that leverages your S3 buckets as a data store with no further data movement required. If you're looking to either process multiple terabytes in a petabyte-scale of data a day or a few hundred gigabytes, this is still economical and worth looking into. You don't have to manage Elasticsearch yourself. If your ELK stack is falling over, take a look at using ChaosSearch for log analytics. Now, if you do a direct cost comparison, you're going to say, “Yeah, 70 to 80 percent on the infrastructure costs,” which does not include the actual expense of paying infrastructure people to mess around with running Elasticsearch themselves. You can take it from me or you can take it from many of their happy customers, but visit chaossearch.io today to learn more.



Now, let's talk about some of these bootstrapping atrocities. Pretend that you have a fleet of physical machines all running virtual machines, and your DNS servers, two in every environment as a minimum, live inside of a VM for easy portability. Great. So, that makes sense; your primary and your secondary wind up being in virtual machines, you can migrate them everywhere. What happens if they migrate onto the same physical server? You have now taken a redundancy and shortcutted it back to a single server going down, causing problems for absolutely everything. If it dies, then there's no DNS in your data center, and everything else will rapidly stop working.



Let's take it a step further. Assume a full site-wide power outage there. If your physical servers need DNS to work in order to boot all the way into a state where they can launch virtual machines successfully and there are no DNS servers available, now what? Well, now you're in trouble. Maybe the answer is to remove that DNS dependency on getting virtual machines up and running. Maybe it's to make the DNS servers physical nodes that each live in different racks and that's sort of an exception to your virtual machine approach.



There are a lot of options you can take here, but being aware the problem and failure mode exist is where you have to start. Without that, you won't plan around it. Another example is a storage area network, or SAN. These things are usually highly redundant, incredibly expensive, and are designed to be always available. But if they're not due to misconfiguration, power issues, someone unplugging the wrong thing, et cetera, suddenly, you've put an awful lot of eggs into one incredibly expensive basket. How do you recover from a SAN outage? Have you ever tried it? Do any of the parts that you're going to need to recover from that outage require, you know, that SAN to be available, or things that are on that SAN to be available?



How things break and what those failure modes look like, are a problem you need to be aware of. And this carries forward into a time of Cloud. If you wind up having a DR plan where you're going to failover from us-east-1, to us-east-2 in AWS land—so from Virginia to Ohio—great. It works super well when you do a DR exercise. But when there's an actual outage in us-east-1, you're not the only person with that plan. So, suddenly, there's control plane congestion, which leads to incredible latency. It may take 15 to 20 minutes for an instance to finish coming up, where before it took 40 seconds.



It's the herd of elephants problem that you're never going to surface in a mock DR test, it only tends to manifest when you start seeing everyone doing the same thing. The way you get around this is, well there are a few ways, but one is to have those systems provisioned and up and running even though you don't need them right now. This is sort of an economic trade-off between what makes sense economically versus what your durability is going to look like. It's a spectrum, and you have to figure out what makes sense. A less expensive approach might be to go in the opposite direction. If you're planning for half the internet to fail from Virginia over to Ohio, maybe you prepare for the opposite: have everything running steady-state in Ohio, and then when that goes down, you spin up things in Virginia or Oregon. It's still going to have that herd of elephants problems, but it's going to be less common than the other direction.



One last single point of failure that I want to highlight is the company credit card for many companies. If your AWS payment fails to go through, what then? Most people have not configured a secondary or tertiary method of payment, and then it becomes a serious problem. If you're not big enough to have your own account manager and a relationship with that person—spoiler, by the way: every account has an account manager. Most don't know it—but if you don't have a working relationship, you can wind up with resources suspended if you miss some increase in the frantic emails. This gets closer into something else that we're going to talk about next week, namely, the underpants problem.



This has been another episode of the AWS Morning Brief: Whiteboard Confessional. I am Cloud Economist Corey Quinn, fixing your AWS bills and your AWS architectures at the same time. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts, Whereas if you've hated it, please leave a five-star review on Apple Podcasts, and tell me what single points of failure I have failed to consider.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 31 Jul 2020 03:00:00 -0700
AWS re:Lease The Kraken
AWS Morning Brief for the week of July 27, 2020.
Mon, 27 Jul 2020 03:00:00 -0700
Whiteboard Confessional: The Worst Thing You’ll See on Any Whiteboard

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Links


Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



This episode is brought to you by Trend Micro Cloud One™. A security services platform for organizations building in the Cloud. I know you're thinking that that's a mouthful because it is, but what's easier to say? “I'm glad we have Trend Micro Cloud One™, a security services platform for organizations building in the Cloud,” or, “Hey, bad news. It's going to be a few more weeks. I kind of forgot about that security thing.” I thought so. Trend Micro Cloud One™ is an automated, flexible all-in-one solution that protects your workflows and containers with cloud-native security. Identify and resolve security issues earlier in the pipeline, and access your cloud environments sooner, with full visibility, so you can get back to what you do best, which is generally building great applications. Discover Trend Micro Cloud One™ a security services platform for organizations building in the Cloud. Whew. At trendmicro.com/screaming.



Welcome to the AWS Morning Brief: Whiteboard Confessional. I am Cloud Economist Corey Quinn, which means that I fix the horrifying AWS bill both by making it more understandable, as well as lower. On today's episode of the AWS Morning Brief: Whiteboard Confessional, I looked around the whiteboards in the backgrounds of the Zoom calls that I'm having with basically everyone these days because going to the office during a pandemic is and remains a deadly risk, and it's amazing how much you can learn about people's companies by what they leave on the whiteboards. Whether you happen to be visiting their office or left inattentively in the background because they forget that they didn't turn on the Zoom background.



One of the most disturbing things that we'll see on any whiteboard in any company that you work at, ever, is an org chart. And what makes it disturbing, first off, is that when you see an org chart, that means that generally, someone is considering reorganizing, which is a polite framing of shuffling the deck chairs on the Titanic. It ties into one of the great corporate delusions that somehow you're going to start immediately making good decisions, and all of the previous poor decision making you've made is going to fall away like dew in the new morning. And the reason that that's the case is that everyone tends to be an optimist when looking forward because otherwise, we'd wake up crying and never go to work.



Have you ever noticed that you can take a look at an org chart or an architecture diagram and remove all of the labels, and you've accidentally built the exact same thing just with different services rather than teams? Well, I'm certainly not the first person to make this observation. What I'm talking about is known as Conway's Law. Conway's Law is named after computer programmer Melvin Conway, who in 1967 introduced a phenomenal idea, for the time, that we still haven’t escaped from, specifically, any organization that designs a system defined broadly will produce a design whose structure is a copy of the organization's communication structure. Effectively what that means is you ship your culture as well as your org chart, and if we take a look at how different seminal software products of the ages have come out. It's pretty clear that there is at least some passing resemblance to reality.



You take a look at Amazon; they're effectively an entire microservices company. They have so many different small two pizza teams building things, and sure enough, you take a look at AWS, for example, they have 200 some-odd services that are ideally production-grade, but again, it's a mixed bag. Because again, not every team is identical, and not every team has the same resources. So, as a result, though, you take a look at that, that is the good part of their culture. Well, what's bad? Well, anything that involves all of those teams to coordinate at once on something. Think of the billing system. Think of the AWS web console. You start to see where these things break down. These are the seams between services that AWS tends to miss out on.



If you take a look at Google, for example, the entire model there, to my understanding, is you want to get promoted and you want to get a raise, and that all comes down to certain metrics that don't necessarily align with what people want to be working on. So, you see people instead focusing on things that are they're incentivized to do to go up in the org, and not maintain things that they built last year, which is why I suspect, at least, that we see this neverending wave of Google product deprecations. And the list goes on, and I'm certainly not a corporate taxonomist; I'm a cloud economist, so I'm not going to go into too much depth on what that looks like in different places, but it does become telling. Let's get into that a bit more. But first:



This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



Now, one thing that I sort of pride myself on being—because I have to be—is data center archaeologist. Frankly, these days it’s cloud archaeology. But when I go into a new client environment, I ask them to show me their architecture diagrams, and that always goes the same way. First, people apologize because the architecture diagram is out of date. Spoiler: everyone's architecture diagram is out of date. That is the nature of the universe. Smile, nod, and accept it.



But it shows a lot about how they think about things; about how different components communicate with each other. And it really tends to lead to an interesting analysis in its own right because what you're really looking at is effectively a second-order effect of what their company culture is. This is why digital transformations are so freaking tricky. You can call them cloud migrations, you can call them digital transformations, you can call them anything as long as companies will sign the $20 million consulting project SOP.



The reason that they're tricky is that you're trying to change the culture of the company. It's my belief, and I know that there's an entire industry who's going to argue with me on this point, that you cannot change the culture of a company externally. It has to be something that you do organically, and it has to come from the top. Something I've learned starting my own company is coming from when it was just me where the culture was a one-to-one match with my personality. Now we're 10 people; it absolutely has changed, and, “We're not going to have a formal organizational structure, just everyone's going to do the right thing.” Yeah, that breaks down right around the time there’s a second person involved.



Coordination among organizations becomes a serious challenge in its own right. There's always going to be some efficiency loss due to communication friction as companies get larger and larger. At some point, this enters the many millions of dollars space and companies sort of take on their own inertia. Earlier this week, IBM announced their earnings, and they went up massively as far as stock value in the after-hours because they didn't lose quite as much money as analysts thought they were going to lose. I'm being slightly sarcastic; they did make a profit but it was declining quarter-over-quarter and year-over-year.



And this is the problem. There's a sort of its own inertia. One of my jokes about IBM has been that they're the kind of company that could go out of business and five years later, some of their divisions might hear about that for the first time. Now, if you're big enough to have an environment where a digital transformation or a cloud migration is tricky—and spoiler: unless you're that one person startup, you're going to have some problems there—you're going to encounter an environment where shifting the culture is required in order to affect meaningful and lasting change in how your product or service gets delivered and built. Cultural change is hard because you're asking people to do things differently, and we are all creatures of habit.



It's one thing to say, “Okay, we're going to use a different tool that everyone has to use.” You can get people to go along with that, grumbling. But you're going to communicate with one another differently? We're going to change our processes? And carrying those legacy processes from the pre-digital transformation era—we'll call it the analog years—forward into a scenario where you have to have a human being approve every step of the build release pipeline. Yeah, it turns out that it's hard to find someone to work in your environment whose name is Jenkins, so at some point, you have to be willing to let go.



Well, some people also further identify the thing that they do as being core and central to their entire identity. The thing that they do defines their own self-perception. And that means that when someone comes in with a new way of doing things, that means that thing that you used to do all the time can now be automated away, or the scope of it can be dramatically reduced, what you're fundamentally trying to sell someone on is their own irrelevance. At least that's the perception. How do you manage that?



Well, that's where the art of digital transformation comes in. I don't have any magic answers on this because I don't sell digital transformations. I sell cloud cost optimization and understanding. Now, this stuff is linked in many cases, but it's certainly not a one-to-one match. And success in shifting the culture is always going to be dependent, on some level, on who is buying into the vision, who has to shift and how, and ultimately the outcome hereafter. Entirely too often, the reason people go through digital transformations is that they hired a new CIO, and that person needs to have something to put on their resume when they get fired in 18 months and look for their next CIO job. At that point, I think I've gotten cynical enough that it's time to call this an episode.



I’m Cloud Economist Corey Quinn. This is the AWS Morning Brief: Whiteboard Confessional. And if you've liked this podcast, please leave a five-star review on Apple Podcasts, whereas if you've hated this podcast, please leave a five-star review on Apple Podcasts along with a copy of your org chart slash microservices architecture diagram.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 24 Jul 2020 03:00:00 -0700
AI/ML Marketing Algorithm Continues to Malfunction
AWS Morning Brief for the week of July 20, 2020.
Mon, 20 Jul 2020 03:00:00 -0700
Whiteboard Confessional: The Right and Wrong Way to Interview Engineers

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links



Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



Sponsorships can be a lot of fun sometimes. ParkMyCloud asked, “Can we have one of our execs do a video webinar with you?” My response was, “Here’s a better idea. How about I talk to one of your customers instead, so you can pay to make fun of you.” And turns out, I’m super-convincing. So, that’s what’s happening. Join me and ParkMyCloud’s customer, Workfront, on July 23rd for a no-holds-barred discussion about how they’re optimizing AWS costs, and whatever other fights I manage to pick before ParkMyCloud realizes what’s going on and kills the feed. Visit parkmycloud.com/snark to register. That’s parkmycloud.com/snark.



Welcome. I am Cloud Economist Corey Quinn, and this is the AWS Morning Brief: Whiteboard Confessional; things that we see on whiteboards that we wish we could unsee. Today I want to talk about the worst whiteboard confessions of all time, and those invariably all tend to circle around what we ask candidates to do on a whiteboard during job interviews. There are a whole bunch of objections, problems, and other varieties of crappy opinions around whiteboarding as part of engineering job interviews, but they're all a part of the larger problem, which is that interviewing for engineering jobs fundamentally sucks. There are enough Medium articles on how trendy startups have cracked the interview to fill an S3 bucket. So, I'm going to take the contrarian position that all of these startups and all of these people who claim to have solved the problem, suck at it.



And these terrible questions fall into a few common failure modes, most of which I've seen when they were levied at me back in my engineering days, and I was exercising my core competency of getting rapidly ejected from other companies. So, I spent a lot of time doing job interviews, and I kept seeing some of the same things appear. And they're all, of course, are different. But let’s start with some of the patterns. The most obnoxious one by far is the open-ended question of how would you solve a given problem? And as you start answering the question, they're paying more attention than you would expect. Maybe someone's on their laptop, quote-unquote ‘taking notes’ an awful lot. And I can't ever prove it, but it feels an awful lot—based upon the question—like, this is the kind of problem where you could suddenly walk out of the interview room, walk into the conference room next door and find a bunch of engineers currently in a war room trying to solve the question you were just asked.



And what I hate about this pattern is it's a way of weaseling free work from interview candidates. If you want people to work on real-world problems, pay them. One of the best interviews I ever had is by a company that no longer exists called Three Rings Design. They wanted me to stand up some three-tiered web app, and turn on monitoring, and standard basic DevOps-style stuff; they gave an AWS account credential set to me. But what made them stand out is that they said, “Well, this is a toy problem. We're not going to use this in production, surprise. It's a generic question. But we don't believe in asking people to do work for free, so when you submit your results, we'll pay you a few hundred bucks.” And this was a standard policy they had for everyone who made it to that part of the interview. It was phenomenal, and I loved that approach. It solved that problem completely. But it's the only time I've ever seen it in my entire career.



A variant of this horrible technique is to introduce the same type of problem, but it's with the proviso that this is a production problem that we had a few months ago. It's gone now, but how would you solve it? Now on its face, this sounds like a remarkably decent interview question. It's real-world. They've already solved it. So, whatever answer you give is not likely to be something revolutionary that's going to change how they approach things. So, what's wrong with it? Well, the problem is, is that in most cases, the right answer is going to look suspiciously like whatever they did to solve the problem when it manifested.



I answered a question like this once with, “Well, what would strace tell me?” And the response was, “What does strace do?” I explained that it attached to processes and looked at the system calls that that process was making, and their response was, “Oh, yeah, that would have caught the problem. Huh. Okay, pretend strace doesn't exist.” Now it's not just the question of how you would solve the problem, but how you would solve the problem while being limited to their particular, often myopic, view of how systems work and how infrastructure needs to play out. This manifests itself the same way, by the way, in the world of various programming languages, and doing traditional developer engineer’s. It's awful because it winds up forcing you to see things through a perspective that you may not share. Part of the value of having you go and work somewhere is to bring your unique perspective. And, surprise, there's all these books on how to pass the technical interview. There are many fewer books on how to freaking give one that doesn't suck. And I wish that some people would write that book and that everyone else would read it. You can tell an awful lot about a company by how they go about hiring their people.



Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



Another very common failure mode for job interviews is when they ask a bunch of trivia questions. “What flag to this command does the following thing?” And what sucks there is it's a question of, “Have you seen this particular thing? And did it stick with you well enough to be able to more or less recite the manual back?” I don't know about you, but when I'm hiring, I believe in trying to find people who can add to the team, and if the best thing that someone can add is their ability to memorize and spit back trivia from man pages, I’m not that interested.



I would much rather see what else they can bring. If you either know it or you don’t, that doesn't tell me much other than have you encountered this one thing in the wild. At some point, do enough job interviews, and you'll be able to pass any interview because you're able to answer all of their toy problems appropriately. One of my biggest pet peeves is when the job interview questions, or the way you're expected to answer them, bear almost no relation to how things work in real life.



The Google phone screen for development roles is notorious for this. They make you write code in a blank Google Doc—pseudocode in most cases—but yeah, no one in the real world actually does this, in the same way that when you're trying to solve a problem, you're not writing code on a whiteboard in front of your coworkers unless you work on the third layer of hell. That's not how we think about these things. In one tab, I have a search engine open; in the other, I'm reading where I've—and also frantically copying and pasting from Stack Overflow, and a third I might be asking folks in a Slack channel, or an IRC network, or even on Twitter, how to solve particular issues that I'm running into. That's how real-world development works. It's not staring at a Google Doc, of all things, and it's not trying to solve these, I guess, ridiculous algorithm questions that sort out already solved problems.



So, that covers a lot of things that are terrible. What do I like to ask? Well, a common question that I've asked over the years that I love has been open-ended questions where people get the chance to talk about things that they're good at. Very often interviews turn into a game of ‘let me find out what you're the worst at and then needle you about it.’ It turns into, effectively then, hiring people based on an absence of weakness, rather than for particular strengths. And when you focus on that part of the story, it very quickly becomes this awful dynamic of having to set up these discussions where, “Oh, tell me what your worst at, and then I could beat the crap out of you on it for 45 minutes.” You can tell so much about a company by how they hire their people. Is that the dynamic of coworker you want to engage with? Not in my case. I've walked out of interviews for things like that because it doesn't go well. Yes, there's privilege inherent in being able to do that, but they're never nicer to you than when they're trying to get you to work there, and it's presumably all downhill from there.



So, one question I love to ask is, have you heard of TinyURL? That actually does have a yes or no answer? And then the answer is no, you can explain what TinyURL does in about 30 seconds. It takes a long URL; it converts that to a much shorter URL, and stores it in a database. When someone visits the short URL, it returns a redirect to the longer URL. The end. You are now up to speed.



And the question I like to give is, pretend that I know nothing about technology other than my ability to write large checks. This is known as being a venture capitalist. Now, I am writing you a large check to build a TinyURL equivalent service; something that will take in a short URL and return a redirect to a long URL. Go. What do I need? And it's completely open-ended, and what's fun about this is it's a small toy problem that encapsulates a whole bunch of different things. And you can go into database design, you can go into networking, you can go into how the system should be structured. You can talk about how you’d build this in Terraform, or CloudFormation or God forbid, something like Puppet or CFEngine. You can cover a lot of ground. You can even talk about the algorithmic complexity of some of the search queries and turn it into a whole interesting series of questions.



But you'll notice that people will generally start moving in a direction when you give them a question like this, that leads directly to the thing that they're best at. Networking people will talk about the networks; systems people will talk about the systems, and it's great. It gives people an opportunity to talk about the things that they are best at, and it lets them shine, which is really what I'm after in the context of a job interview.



Then you can make the question more complex: “Okay, so it works for six months. This thing you build is awesome. Now it's slow. Where in the system is it going to be slow?” Well, people will be able to identify bottlenecks, ideally, and then you can start throwing in other things like I want it to be in multiple regions, and I want it to be able to withstand a regional outage, and how will your system scale to that point?



And you can push back with constraints like this. And generally, it becomes somewhat frustrating at times, if you're not careful, so you want to make sure you frame it appropriately, but you also get to see how people have constructive discussions about constraints. If they get angry and start yelling at you, well, this is when they're supposed to be on their best behavior. In practice, having them go back and forth with you about different aspects is great. You want to get the questioning to a point where one of you is saying, I don't know in response to something the other person says because if you don't hit that depth, you're generally not getting to a point where you see the limits of someone's knowledge, and how they react when they reach it.



Now, this is not a one size fits all interview question, but you can get an entire interview out of it if you want to. I'm not saying it's a silver bullet; I'm just saying that it's how I’ve liked to approach these things in the past for systems interviews. And it's certainly better than asking people to work for free or to do horrifying things that speak to terrible things about your culture.



This has been the AWS Morning Brief: Whiteboard Confessional ranting about interviews. I am Cloud Economist Corey Quinn, and this is the AWS Morning Brief. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts, whereas if you hated this podcast, please leave a five-star review on Apple Podcasts anyway, and implement QuickSort in the comments.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 17 Jul 2020 03:00:00 -0700
AWS Machine Learning Your Business From Inside
AWS Morning Brief for the week of July 13, 2020.
Mon, 13 Jul 2020 03:00:00 -0700
Whiteboard Confessional: The Curious Case of the 9,000% AWS Bill Increase

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links


Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



This episode is sponsored in part by ParkMyCloud, fellow worshipers at the altar of turned out [BLEEP] off. ParkMyCloud makes it easy for you to ensure you're using public cloud like the utility it's meant to be. just like water and electricity, You pay for most cloud resources when they're turned on, whether or not you're using them. Just like water and electricity, keep them away from the other computers. Use ParkMyCloud to automatically identify and eliminate wasted cloud spend from idle, oversized, and unnecessary resources. It's easy to use and start reducing your cloud bills. get started for free at parkmycloud.com/screaming.



When you're building on a given cloud provider, you're always going to have concerns. If you're building on top of Azure, for example, you're worried your licenses might lapse. If you're building on top of GCP, you're terrified that they're going to deprecate all of GCP before you get your application out the door. If you're building on Oracle Cloud, you're terrified, they'll figure out where you live and send a squadron of attorneys to sue you just on general principle. And if you build on AWS, you're constantly living in fear, at least in a personal account, that they're going to surprise you with a bill that's monstrous.



Today, I want to talk about a particular failure that a friend of this podcast named Chris Short experienced. Chris is not exactly a rank neophyte to the world of Cloud. He currently works at IBM Hat, which I'm told is the post-merger name. He was deep in the Ansible community. He's a Cloud Native Computing Foundation Ambassador, which means that every third word out of his mouth is now contractually obligated to be Kubernetes.



He was building out a static website hosting environment in his AWS account, and it was costing him between $10 and $30 a month. That is right aligned with what I tend to cost. And he wound up getting his bill at the end of the month: “Welcome to July, time to get your bill,” and it was a bit higher. Instead of $30, or even $40 a month, it was $2700. And now there was actual poop found in his pants.



This is a trivial amount of money to most companies, even a small company, and I say this from personal experience, runs on burning piles of money. However, a personal account is a very different thing. This is more than most people's mortgage payments if you don't make terrible decisions like I do, and live in San Francisco. This is an awful lot of money, and his immediate response was equivalent to mine. First, he opened a ticket with AWS support, which is an okay thing to do. Then he immediately turned to Twitter, which is the better thing to do because it means that suddenly these stories wind up in the public eye.



I found out roughly 10 seconds later, as my notifications blew up with everyone saying, “Hey, have you met Corey?” Yes, Chris and I know each other. We're friends. He wrote the DevOps’ish newsletter for a long time, and the secret cabal of DevOps-y type newsletters runs deep. We secretly run all kinds of things that aren't the billing system for cloud providers.



So, he hits the batphone. I log into his account once we get a credential exchange going, and I start poking around because, yeah, generally speaking, 100x bill increase isn't typical. And what I found was astonishing. He was effectively only running a static site with S3 in this account making the contents publicly available, which is normal. This is a stated use case for S3, despite the fact that the console is going to shriek it's damn fool head off at you at every opportunity, that you have exposed an S3 bucket to the world.



Well, yes, that is one of its purposes. It is designed to stand there, or sit there depending on what a bucket does—lay there, perhaps—and provide a static website to the world. Now, in a two-day span, someone or something downloaded data from this bucket, which is normal, but it was 30 terabytes of data, which is not. At approximately nine cents a gigabyte, this adds up to something rather substantial, specifically after free tier limits are exhausted, that's right: $2700.



Now, the typical responses to what people should do to avoid bill shocks like this don't actually work. “Well, he should have set up a billing alarm.” Yeah, aspirationally the AWS billing system runs on an eight-hour eventual consistency model, which means that at the time the bill starts spiking. He has at least 8 hours, and in some cases as many as 24 to 48, before those billing alarms would detect. The entire problem took less time than that.



So, at that point, it would be alerting after something had already happened. “Oh, he shouldn't have had the bucket available to the outside world.” Well, as it turns out, he was fronting this bucket with CloudFlare. But what he hadn't done is restrict bucket access to CloudFlare’s endpoints, and for good reason. There's no way to say, “Oh, CloudFlare’s, identity is going to be defined in an IAM managed policy.” He has to explicitly list out all of CloudFlare’s IP ranges, and hope and trust that those IP ranges will never change despite whatever networking enhancements CloudFlare makes, it's a game of guess and check and having to build an automated system around this. Again, all he wanted to do was share a static website. I've done this myself. I continue to do this myself and it costs me, on a busy month, pennies. In some rare cases, dozens of pennies.



Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



So, we start looking at this and it becomes abundantly clear that there is a serious problem here. It all came out of that bucket; there was nothing else being used in this environment that would have caused this. The first 10 terabytes were charged at nine cents a gigabyte, then down to eight and a half cents per gigabyte for the next 40 terabytes, which it never exceeded, and that's where it was left: at about roughly 30 and a half terabytes. Now, what caused this? Nobody knows because logging wasn't enabled on this bucket and you can't do that retroactively, so we have no idea what objects were involved, or which requester IPs are involved.



So, anything you tell Chris that he should have done is telling him this after it's too late to do anything. Again, he is not some random fool who fails to understand how computers work. He's a Cloud Native Computing Foundation Ambassador, he works at IBM Hat, and he is very well known in this space. A coulda woulda, shoulda response here does not work because it comes down to the fact that, yeah, this is actual personal money here. He tweeted a thank you to his wife for not effectively having a mild heart attack in response to this.



Now, at the time of this recording, I have it on good authority that AWS is going to make this right in his account. But he's depending upon the largesse of a company to look at this and say, “Oh, that's okay.” There's no guarantee that that's going to happen. That's a concern that he should not have to deal with here. There's a world of difference between a person on a personal account playing around with AWS, and what an enterprise would expect to happen.



I've been agitating for a while for a better approach to the free tier, and this highlights exactly why that is. I want a model where rather than charging me through the nose, you stop serving traffic. A bunch of toy experiments I have in place would benefit greatly from this particular model because I don't want to wind up having to cut a mortgage check to AWS, I just want it to stop serving my ridiculous thing that accidentally got stuck in a loop or got pushed to the wrong place.



Again, I want to be explicitly clear here. Setting up this static site, Chris did nothing wrong by any measure. This was not a misconfigured S3 bucket. This was not passing data back and forth 50 times because of bad architecture. He is using the system as it was intended to be used, and as it was designed to be operated, and then a surprise bill hit him out of nowhere.



Now, what's fascinating to me about this is things like this happened to my clients all the time, and they really don't care all that much. When you're spending a couple million dollars a month, you don't really care about a $3,000 bill surprise. In fact, it's hard to find it in the noise of everything else going on in the account. So, things like this aren't on larger company’s radars as a risk.



There are ways to wind up catching this from the outside. Everything I can think of though is a little nutty. You could enable advanced monitoring on the S3 buckets. You can see the bytes downloaded on a per bucket basis and then have deviation monitoring. But that's going to be super noisy because it's going to go off whenever you use it, and it's significantly above where it was before. It's an awful lot of problem here, and there still is no good solution other than looking at Chris and clucking at him, telling him he did it wrong.



So, when I went through and did this analysis, last month's bill was just shy of $2700. This month bill so far—and I'm recording this on July 6th—was 28 cents, but in a final, just to make sure it hurts you a little bit more, AWS now predicts the bill for this month in his account for the month of July to be just over $1100. This is one of those customer profiles that everyone starts out at, and it's a profile that gets left behind by the current painful process and horrifying nightmare that is the AWS billing system.



I'm Cloud Economist Corey Quinn. This is the AWS Morning Brief: Whiteboard Confessional series. And if you've enjoyed this podcast, please leave a five-star review on Apple Podcasts, whereas if you’ve hated this podcast please leave a five-star review on Apple Podcasts and leave an S3 bucket open for me.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 10 Jul 2020 03:00:00 -0700
Kicking AWS's ASS into Space
AWS Morning Brief for the week of July 7, 2020.
Mon, 06 Jul 2020 03:00:00 -0700
Whiteboard Confessional: The Day IBM Cloud Dissipated

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links


Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



This episode is sponsored in part by ParkMyCloud, fellow worshipers at the altar of turned out [BLEEP] off. ParkMyCloud makes it easy for you to ensure you're using public cloud like the utility it's meant to be. just like water and electricity, You pay for most cloud resources when they're turned on, whether or not you're using them. Just like water and electricity, keep them away from the other computers. Use ParkMyCloud to automatically identify and eliminate wasted cloud spend from idle, oversized, and unnecessary resources. It's easy to use and start reducing your cloud bills. get started for free at parkmycloud.com/screaming.



Welcome to the AWS Morning Brief’s Whiteboard Confessional series. I am Cloud Economist Corey Quinn, and today's topic is going to be slightly challenging to talk about. One of the core tenants that we've always had around technology companies and working with SRE, or operations-type organizations is, full stop, you do not make fun of other people's downtime because today it's their downtime, and tomorrow it's yours. It's important. That's why we see the hashtag #HugOps on Twitter start to—well, not trend. It's not that well known but definitely happens fairly frequently when there's a well-publicized multi-hour outage that affects a company that people are familiar with.



So, what we're going to talk about is an outage that happened several weeks ago for IBM Cloud. I want to point out some failings on IBM’s part but this is in the quote-unquote, “Sober light of day.” They are not currently experiencing an outage. They've had ample time to make public statements about the cause of the outage. And I've had time to reflect a little bit on what message I want to carry forward, given that there are definitely lessons for the rest of us to learn. HugOps is important, but it only goes so far, and at some point, it's important to talk about the failings of large companies and their associated response to crises so the rest of us can learn.



Now, I'm about to dunk on them fairly hard, but I stand by the position that I'm taking, and I hope that it's interpreted in the constructive spirit that I intend it to. For background, IBM Cloud is IBM's purported hyperscale cloud offering. It was effectively stitched together from a variety of different acquisitions, most notable among them SoftLayer. I've had multiple consulting clients who are customers of IBM Cloud over the past few years, and their experience has been, to put it politely, a mixed bag. In practice, the invective that they would lobby against it would be something worse.



Now, a month ago, something strange happened to IBM Cloud. Specifically, it went down. I don't mean that a service started having problems in a region. That tends to happen to every cloud provider, and it's important that we don't wind up beating them up unnecessarily for these things. No, IBM Cloud went down. And when I say that IBM Cloud went down, I mean, the entire thing effectively went off the internet. Their status page stopped working, for example. Every resource that people had inside of IBM Cloud was reportedly down. And this was relatively unheard of in the world of global cloud providers.



Azure and GCP don't have the same isolated network boundary per region that AWS has, but even in those cases, we tend to see far more frequently rolling outages rather than global outages affecting everything simultaneously. It's a bit uncommon. What's strange is that their status page was down. Every point of access you had into looking at what was going on with IBM Cloud was down. Their Twitter accounts fell silent, other than pre-scheduled promotional tweets that were set to go out. It looked for all the world like IBM had just decided to pack up early, turn everything off on the way out of the office, and enjoy the night off.



That obviously isn't what happened, but it was notable in that there was no communication for the first hour or so of the outage, and this was causing people to go more than a little bonkers. One of the pieces that was interesting to me, while this was happening, since it was impossible to get data out of this for anything substantive or authoritative, was I pulled up their marketing site. Now, the marketing site still worked—apparently, it does not live on top of IBM Cloud—but it listed a lot of their marquee customers and case studies. I went through a quick sampling, and American Airlines was the only site that had a big outage notification on the front of it. Everything else seemed to be working.



So, either the outage was not as widespread as people thought, or a lot of their marquee customers are only using them for specific components. Either one of those is compelling and interesting, but we don't have a whole lot of data to feed back into the system to draw reasonable conclusions. Their status page itself, like it was mentioned, was down, and that's super bad. One of the early things you learn when running a large-scale system of any kind is the thing that tells you—and the world—that you're down cannot have a dependency on any of the things that you are personally running. The AWS status page had this, somewhat hilariously, during the S3 outage a few years ago, when they had trouble updating what was going on due to that outage. I would imagine that's no longer the case, but one does wonder.



And most damning, and the reason I bring this up is the following day, they posted the following analysis on their site: “IBM is focused on external network provider issues as the cause of the disruption of IBM Cloud services on Tuesday, June 9th. All services have been restored. A detailed root cause analysis is underway. An investigation shows an external network provider flooded the IBM Cloud network with incorrect routing, resulting in severe congestion of traffic, and impacting IBM Cloud services, and our data centers. Migration steps have been taken to prevent a recurrence. Root cause analysis has not identified any data loss or cybersecurity issues. End of message.”



Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



Now, my problem with that is it focuses on the idea of a single root cause, which most of the folks in the human factors part of the internet will tell you is never a true statement. In fact, J. Paul Reed, a friend of the podcast newsletters and occasionally, me, will angrily shake his fist. He hates that almost as much as he does the five whys. But the point here is that if a single provider messing up their network announcements can take down 80 of your cloud data centers for hours, and you're unable to communicate with the outside world, yeah, that's obviously bad, but you have failed on a multitude of different levels at building a robust system that can withstand that kind of disruption.



Perhaps most damning of all from those customers that I mentioned earlier who have a presence on IBM Cloud, they were texting with their account managers, because the account managers had no access to any internal systems. Reportedly, the corporate VPN was not working. My thesis is, therefore, that given that everyone was remote, no one was on site, everything was single-tracking through a corporate VPN that itself was subject to this disruption, and now there was no one able to log in to send a message authoritatively on behalf of IBM. All of their traditional tweets have been done through an enterprise social media client called Sprinklr with no e in it because social media. Enterprise. Ehh. And surprisingly, all of the developer advocates that I know of—I checked their feed during this outage—they were completely silent.



So, it was clear that no one was authorized to communicate about the outage and silence when a customer is in pain, is one of the worst things you can do. Explain to them that you're aware of this, you're focusing on it, you will have updates to them on a cadence. That is what breeds trust. No one expects a system to never go down, but they do have significant expectations around what is going to be done in the wake of outages. So, things that I take away from this would be, if it were me, that it's important to have ways into the network for specific folks that aren't tracked through the same things that are potentially going to go down in the event of a network disruption; you need to have a crisis communications plan for social media and other formats; when the corporate VPN is down, you can't bottleneck through there; and most importantly, you absolutely cannot blame arbitrary third-party misconfigurations mistakes—which is, let's face it, what the internet is built on top of—for a global multi-hour outage if you expect to be taken seriously in the world of cloud providers.



In the wake of this, barring further communication, I have no choice but to nominate IBM Cloud for the Oxymoron of the Year. I know it seems harsh, but there are so many missteps and failings here that it is apparent that IBM is not willing to have a good-faith public conversation about this, instead hoping to sweep it under the carpet and hope that no one brings it up ever again. That's not how we improve. We all make mistakes. We all take outages. AWS will periodically have full-on analyses of what broke. Google has some of the best in the world that I've ever seen when they take outages for various things. Microsoft has turned explaining business outages to business into an art form, and they have 40 years of experience doing it. They are polished almost to an annoying degree. From IBM we've gotten only silence, stonewalling, and blaming others, and viewed through the lens of responsible cloud providers to pick, I am seriously doubting IBM and their capability to servicing the market at this time, barring further self-reflection and the results of that communicated in public.



This has been the Whiteboard Confessional version of the AWS Morning Brief. I am Cloud Economist Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you've hated this podcast, you almost certainly worked for IBM, and are not allowed to use Apple products anyway.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 03 Jul 2020 03:00:00 -0700
Oh, Honey; Help the Cops in us-west-3
AWS Morning Brief for the week of June 29, 2020.
Mon, 29 Jun 2020 03:00:00 -0700
Whiteboard Confessional: Bespoke Password Management

About Corey Quinn


Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links

Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



This episode is sponsored in part by ParkMyCloud, fellow worshipers at the altar of turned out [bleep] off. ParkMyCloud makes it easy for you to ensure you're using public cloud like the utility it's meant to be. just like water and electricity, You pay for most cloud resources when they're turned on, whether or not you're using them. Just like water and electricity, keep them away from the other computers. Use ParkMyCloud to automatically identify and eliminate wasted cloud spend from idle, oversized, and unnecessary resources. It's easy to use and start reducing your cloud bills. get started for free at parkmycloud.com/screaming.



In today's episode of the Whiteboard Confessional on the AWS Morning Brief, I want to talk to you about how I log into AWS accounts. Now, obviously, I've got a fair few of them here at The Duckbill Group, ranging from accounts that I use to test out new services, to the accounts that run my Last Week in AWS newsletter production things, to my legacy account because of course I have a legacy account for a four-year-old company. This is the Cloud we're talking about. And, as of this writing, they add up to currently 17 accounts in our AWS organization.



Beyond that, there's a lot more we have to worry about. We assume restricted roles into client AWS accounts to conduct our cost analyses. Getting those set up has been a bit of a challenge historically. We have a way of doing it now that we've open-sourced in our company GitHub repo. Someday, someone will presumably discover this, and then I'll get to tell that story. Now, to add all of this complex nonsense, let's not forget that back when I used to travel to other places, before the dark times we're currently living in, I used to do all of my work when I was on the road from an iPad Pro.



So what was the way to intelligently manage logging into all of these different accounts and keep them straight? Now, using IAM passwords and username pairs is patently ridiculous. By the time you take in whatever accounts I'm currently working on, we've got, eh, 40 AWS accounts to care about, which would completely take over my password manager if I go down that path, it further wouldn't solve for the problem of most of the time I interact with these accounts only via API. Now, that's not entirely true because, as we've mentioned, the highest level of configuration management enlightenment is, of course, to use the console, and then lie about it.



Today, I want to talk about how I chained together several ridiculous things to achieve an outcome that works for basically all of these problems. There are almost certainly better ways to do this than what I do. I keep hearing rumors that AWS Single Sign-On can do all this stuff in a better way, but every time I attempt to use it, I get confused and angry and storm off to do something else. So here's what I do. First, I start with my baseline AWS account that has an actual IAM user with a permanent set of credentials in it. That's my starting point. Now, I store those credentials on my Mac in Keychain, and on my EC2 instance running Linux, it lives within the pass utility, which uses GPG-based encryption to store a string securely.



Now, before I get angry letters—because oh, dear Lord, do I get them—let me just say that this is a requirement that instance roles with those ephemeral credentials won't suit. So using an instance role for that EC2 instance won't apply. Specifically, because there's no way today to apply MFA to instance roles, and some of the roles I need to assume do have MFA as a requirement, so that's a complete non-starter. And the way that I manage in these different environments, those effective route pair of credentials are managed by a tool that came out of 99 designs called aws-vault. Don't confuse this with HashiCorp’s Vault, which is something else entirely. This started off as a favorite of mine, but given their periodic breaking changes that the aws-vault maintainers have introduced with different versions, it becomes something far less treasured. They'll release a bunch of enhancements that up the version, which is great, but they haven't gotten around to fixing the documentation well, so I have to stumble my way through it, and I'm angry every time I spin up something new, and then I give up and roll back to a version that works.



There are now other tools I'm looking at as an alternative to this, mostly because this behavior has really torqued me off. Now aws-vault, as well as many other tools in the ecosystem, can read your local configuration file in your .aws directory. It uses this for things like chaining roles together, so you can assume a role in an account that then is allowed to assume a role in a different account, and so on and so forth. It can tell you which credential set to use, which MFA device is going to be used to log into accounts, what region that account is going to be primarily based in etcetera. It's surprisingly handy except for when it breaks with aws-vault releases in [unintelligible] what it's expecting to see in that file. I digress again. Sorry, just thinking about this stuff makes me mad, so I'm going to cool down for a second.



Corey: This episode is sponsored in part by ChaosSearch. Now their name isn’t in all caps, so they’re definitely worth talking to. What is ChaosSearch? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click. Boom. Done. ChaosSearch is for you if you’re trying to get a handle on processing multiple terabytes, or more, of log and event data per day, at a disruptive price. One more thing, for those of you that have been down this path of disappointment before, ChaosSearch is a fully managed solution that isn’t playing marketing games when they say “fully managed.” The data lives within your S3 buckets, and that’s really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That’s chaossearch.io.



Now I can interact with aws-vault in two ways. One spawns a shell that has all of the usual AWS environment variables you would expect it to have, with temporary credentials and session tokens. Great. Suddenly, every other tool on the planet does not need to be taught how to work with an assumed role. It just runs locally, it sees those environment variables, and they all do the right thing.



Now, that's not quite as exciting to talk about. I’ve built something monstrous, and that's what I'm here to talk about today. Which brings us to the second way that I interact with aws-vault, which is to have it spit out a URL that logs you into the AWS console. Now on a desktop, it automatically opens your browser and logs you in. Of course, this doesn't work super well on an iPad that's remoting into an EC2 instance, or from an EC2 instance at all for that matter. It turns out that with super small text on a high-resolution display, the URL that I use to log in that this thing spits out is three lines long since signed URLs in AWS land are apparently some of that experience they claim that there's no compression algorithm for.



So this is where I took something that's already monstrous and made it worse. I built a shortcut that spits out a login link to generate that long signed link [unintelligible] above. Cool. Then I pipe that result to a script that I wrote. That script generates a UUID or Universally Unique Identifier. This is a 128-bit number. The odds of generating two that are the same are astronomical. Specifically, if you generate a billion of these a second, you'll have one collision every 85 years. Next, that UUID becomes the name of an S3 object. That object is set to redirect to the stupidly long URL that aws-vault has spit out. So I have a short link I can now click, but it turns into that long link through a redirect.



But wait; I'm not done. That URL is potentially dangerous. If anyone else sees it, they can log in as me. I've gotten around this in a few ways. The S3 bucket that serves the redirect is fronted by a CloudFront endpoint. SSL is required: at no point is the URL going to be communicated in cleartext. Further, I wrote a Lambda@Edge function that's attached to that CloudFront distribution. When it receives the request, it not only returns the redirect, it also deletes the S3 object that's referenced. So while this does sometimes mean that the link is cached in my local browser, I've disabled caching in CloudFront. So now when I do race tests between two computers, the first one will resolve the link and log me into the console, the second computer, hit at roughly the same time, does not. Lastly, should the Lambda function ever fail, there is a periodic reaping job that removes all the redirects from that bucket, taking care to bypass the index file, which exists solely to prevent people from seeing the redirect objects that are currently there.



A few security friends of mine took a look at this and all came to the same conclusion: this is, A) ridiculous, B) overbuilt and, C) it works. That beautiful trifecta of a combination made it a perfect topic to discuss in this week's episode of the AWS Morning Brief: Whiteboard Confessional.



I am Cloud Economist Corey Quinn. This is the AWS Morning Brief. And if you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you've hated it and found it appalling, leave a five-star review on Apple Podcasts anyway, and tell me what I should be using instead.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 26 Jun 2020 03:00:00 -0700
111 Gigabytes Per Ounce
AWS Morning Brief for the week of June 22, 2020.
Mon, 22 Jun 2020 03:00:00 -0700
Whiteboard Confessional: Help, I’ve Lost My MFA Device!

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links

Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



This episode is sponsored by a personal favorite: Retool. Retool allows you to build fully functional tools for your business in hours, not days or weeks. No front end frameworks to figure out or access controls to manage; just ship the tools that will move your business forward fast. Okay, let's talk about what this really is. It's Visual Basic for interfaces. Say I needed a tool to, I don't know, assemble a whole bunch of links into a weekly sarcastic newsletter that I send to everyone. I can drag various components onto a canvas: buttons, checkboxes, tables, etc. Then I can wire all of those things up to queries with all kinds of different parameters, post, get, put, delete, etc. It all connects to virtually every database natively, or you can do what I did and build a whole crap ton of lambda functions, shove them behind some API’s gateway and use that instead. It speaks MySQL, Postgres, Dynamo—not Route 53 in a notable oversight; but nothing's perfect. Any given component then lets me tell it which query to run when I invoke it. Then it lets me wire up all of those disparate APIs into sensible interfaces. And I don't know frontend; that's the most important part here: Retool is transformational for those of us who aren't front end types. It unlocks a capability I didn't have until I found this product. I honestly haven't been this enthusiastic about a tool for a long time. Sure they're sponsoring this, but I'm also a customer and a super happy one at that. Learn more and try it for free at retool.com/lastweekinaws. That's retool.com/lastweekinaws, and tell them Corey sent you because they are about to be hearing way more from me.



Welcome to the AWS Morning Brief: Whiteboard Confessional. Today I want to talk about infosec. Specifically, an aspect of infosec that I think is not given proper attention, namely two-factor auth. Now, two-factor auth is important to enable but first, back up a second. Use a password manager with strong passwords for all of your stuff. Those are table stakes at this point.



Now, most password managers will offer to also store your multi-factor auth codes, your OTP tokens, etcetera. I'm not a big fan of that because it feels to me, perhaps incorrectly, like I'm collapsing multiple factors back down into that same factor. Someone gets access to my password manager, worst-case scenario, I’m potentially hosed. That's not great. Now, the password managers will argue that this isn't technically true, yada, yada. I'm old fashioned. I'm grumpy. I'm an old Unix systems administrator that had certain angry loud opinions, so I'm going to keep using separate tools for managing passwords, as well as getting in as a second factor. May I also point out that SMS is terrible as far as a second factor. Don't use it if you possibly can, for reasons that go well beyond the scope of this show: we're not that kind of podcast.



Now, let's talk about what happens if you, for one reason or another, lose your MFA device, or the app on your phone because this happened to a certain business partner of mine named Mike Julian. Now, Mike wound up getting a new phone, which is great because his was something from the Stone Age presumably some kind of Nokia candy bar phone. I hear someone dropped one of those things once the last time they were in mass sale and accidentally killed the dinosaurs. So, that's the kind of era of phone he was upgrading from to, I think, the iPhone SE, but don't quote me on that. I don't tend to pay attention to his taste in electronics. Personally, I question his taste in business partners, but that's all right; he signed on the dotted line; he stuck with me now.



So, he inadvertently wound up losing access to all of his old MFA tokens and having to get them re-added in other places. Some systems worked super well for this. It was a matter of, “Oh, I'll just use my backup codes,” which he kept like a good responsible person. It let him in, he would then be able to regenerate backup codes, change over the device and everything was glory. For others, he wasn't so lucky and had to phone in and get a reset after identity verification. So, now he didn't have his multi-factor device, so it would fall back to using SMS because it had his cell phone. And he could not disable that with some environments. So, that becomes an attack vector, if you're able to compromise an SMS number which, surprise, is not that hard if you put some effort into it.



This, of course, does bring us to Amazon. Mike needed to reset his Amazon MFA token. Now, when I say Amazon, I don't mean AWS. I mean, Amazon, and I'm going to go back and forth as I go down the story a little bit. So, this is an Amazon retail account, not an AWS account. And it turns out when you Google how to reset your Amazon MFA token, all the results are about AWS.



So, “Okay, that's interesting,” says Mike. He Googles effectively to remove all results from aws.amazon.com. Cool. Now all the results are about things that are not Amazon stuff. Not anything helpful. So, there's no documentation in Google for any of this as applies to Amazon retail, it may as well not exist as a problem. This is less than ideal from Mike's perspective. He was able to reset his AWS multi-factor auth for the AWS account—that's for the same email address tied to that amazon.com account, but AWS and Amazon have completely separate MFA infrastructures.



So, this is fascinating. He posts on Twitter, which is the number one way to get help when you have an Amazon issue and you run a company devoted to making fun of Amazon, and AWS support chimes in because they're helpful. Someone else says, “I've been trying to solve this problem for 10 years and got nowhere. Good luck, Godspeed.” And it seemed odd because it's an Amazon retail problem. Why is AWS chiming in? And this leads to a phone call. Mike finally winds up getting on a series of phone calls with AWS support.



Let me handwave past the boring part. More or less, no one knew what the story was here. It goes back and forth, and back and forth. It turns out that if you have an AWS account and an amazon.com account tied to the same email address, there's a little known secret that is kept secret not just from customers, but from support on both sides in many cases, where if you disable MFA for your amazon.com account, the AWS token for your linked AWS account now takes over on the amazon.com account, because everything is terrible, and broken, and awful.



And it was a living nightmare to sort that out. If someone is listening to this, ideally this may save you if you're still suffering from the Underpants Problem. For those who are not familiar, the Underpants Problem is when your $4 billion startup unicorn has all of its infrastructure running in an AWS account that is still linked to the amazon.com account that your founder uses to buy underpants. That's why it's called the Underpants Problem. This was not, however, the worst problem that Mike encountered while replacing his many, many MFA tokens. But first:



This episode is sponsored in part by N2WS. You know what you care about? Many things, but never backups. At least until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers; you can back things up cost-effectively, and safely. For a limited time, N2WS is offering you $100 in AWS credits for setting up their free trial, and I encourage you to give it a shot. To learn more, visit snark.cloud/n2ws. That's snark.cloud/n2ws.



I'm not going to name the vendor because I've reached out to them, and had a long conversation with their CSO, and identified a whole bunch of failures here and it's not constructive to name them, so I'm not going to. But the story is instructive. For one of our business service vendors, Mike called up support as you’re want to do. The answer then was, “Oh, okay. What's your name?” Mike gave his name. It's Mike Julian, for those following along at home. “Great. What's your email address?” Was the second question, which, again, for those following along at home is not that hard to wind up figuring out, if you have ever emailed with someone at this company. Spoiler: it might be “Mike at,” similar to the fact that I'm “Corey at,” and the third question was—just kidding. There was no third question. “There you go. I have disabled the MFA portion of your account. Have a nice day.”



There was also no email that went out to other admins, owners, etcetera, of this account. So, it was a pure just take someone at their word when they call in, and then go ahead and disable that for the asking. You begin to see the problem here, because the vendor in question has a bunch of sensitive data that we kind of have to give them, given the nature of what they do, and it's not something that anyone really has much of a choice in once you pick a vendor. So, it became a difficult situation.



So, I lashed out on Twitter, as I tend to do, without naming names; then I calmed down a little bit. Pro tip: don't make decisions when you're angry. You almost never make the right one. And then I had a conversation with their CSO because it turns out that when you're loud and obnoxious and have your own podcast, you could get in front of all kinds of folks.



And I learned a couple of things. First, there was another form of authentication. Namely, they were able to figure out that the phone number that Mike was calling in from was in fact associated with the account and the way that they wind up doing phone number identification, it's not impossible, but it's not trivial to spoof it in that respect and, frankly, the answer then became that someone in their support center did not follow the process and procedure that they were supposed to. I validated this was in fact, a teachable moment and not, I'm going to get someone fired by mistake moment, and they had the right answers, and I genuinely believe that they were sincere and correct in this.



But the lesson that we take away from this, it goes well beyond MFA, and goes well beyond security, and goes into a human problem, for which there is as of yet, no patch. And that simply is, is that no matter what you do for policies and procedures, and what you build, and how your security flow works, if someone can choose to disregard it and just bypass all of the authentication and the checks, you need to have a plan for that. Expect people to inherently do the wrong thing. Make it hard for people to do the wrong thing; make it easy for them to do the right thing and follow up on these things. Credit where due, they were able to pull the recording in question and validate exactly what happened and why.



Note also that when an angry customer calls up, “I'm locked out of my account. And I hate having to call in.” There's a human-nature piece where you want to make that person happy. Because, sure, people might go on the internet and tell lies, but call you up on the phone? Who would do such a thing? That's not reasonable. That's not something that anyone would do here on my society. So, understand that there are a lot of challenges here that need to be worked through. There are a lot of process flows. And training and drilling people on procedures is important. And no, firing people who get it wrong is never a valid answer, because that doesn't go super-well. It winds up incentivizing people to hide mistakes. Don't do that. This has been sort of a letting down of my originally promised zero-day disclosure on this. This is why, incidentally, it's best not to make tweets or podcasts while angry.



This has been the AWS Morning Brief: Whiteboard Confessional. I am Cloud Economist Corey Quinn fixing AWS bills here or wherever I happen to find them. And if you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you've hated this podcast, please leave a five-star review on Apple Podcasts anyway, and tell me what I should be using for an MFA provider instead.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 19 Jun 2020 03:00:00 -0700
AWS Graviton2 Clock Speeds Broadly Non-Competitive
AWS Morning Brief for the week of June 15, 2020
Mon, 15 Jun 2020 03:00:00 -0700
Whiteboard Confessional: On Getting Fired

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Links


Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



This episode is sponsored by a personal favorite: Retool. Retool allows you to build fully functional tools for your business in hours, not days or weeks. No front end frameworks to figure out or access controls to manage; just ship the tools that will move your business forward fast. Okay, let's talk about what this really is. It's Visual Basic for interfaces. Say I needed a tool to, I don't know, assemble a whole bunch of links into a weekly sarcastic newsletter that I send to everyone. I can drag various components onto a canvas: buttons, checkboxes, tables, etc. Then I can wire all of those things up to queries with all kinds of different parameters, post, get, put, delete, etc. It all connects to virtually every database natively, or you can do what I did and build a whole crap ton of lambda functions, shove them behind some API’s gateway and use that instead. It speaks MySQL, Postgres, Dynamo—not Route 53 in a notable oversight; but nothing's perfect. Any given component then lets me tell it which query to run when I invoke it. Then it lets me wire up all of those disparate APIs into sensible interfaces. And I don't know frontend; that's the most important part here: Retool is transformational for those of us who aren't front end types. It unlocks a capability I didn't have until I found this product. I honestly haven't been this enthusiastic about a tool for a long time. Sure they're sponsoring this, but I'm also a customer and a super happy one at that. Learn more and try it for free at retool.com/lastweekinaws. That's retool.com/lastweekinaws, and tell them Corey sent you because they are about to be hearing way more from me.



Today's episode of the AWS Morning Brief: Whiteboard Confessional was supposed to be about a zero-day that I was disclosing. Cooler heads have prevailed and we will talk about that next week instead, once I've finished some conversations with the company in question. Sorry to disappoint you all, but I have something you might enjoy instead.



So, today I want to talk about getting fired, which is one of my personal specialties. I'm not kidding when I tell people that a primary driver of starting my own consultancy was to build a company wherein I could not be ejected on the spot by surprise. Since I can't be fired anymore, let's talk about the mechanics of getting fired from someone who's been through it, just so folks get a better perspective on this. In the United States, our worker protections are basically non-existent compared to most civilized countries. Barring a contract or collective bargaining agreement to the contrary, you can be fired in the United States for any reason or no reason, except based upon membership in a protected class.



So, to be clear, my personality is certainly justification enough to fire me. I say this for our listeners in other countries who hear I was fired and equate that to a moral failing. “What’d you do, rob the cash register?” No, I'm just me; I'm difficult to work with; I'm expensive to manage, and my personality is exactly what you would expect it to be based upon this podcast. The way the firing usually works is that you get an unexpected meeting request with your boss. “Hey, can we chat?”



Those meetings are so unnerving that even that intro leaves scars years later, my business partner and I—both of us can't be fired clearly. But we still get nervous when we tell each other, “Hey, we need to talk in an hour.” We have instituted an actual policy against this at our company, just due to the collective trauma that so many of us have gone through with those, “Is this how I get fired?” moments. So, you have an unplanned meeting with your boss. Nine times out of 10—or more: 99 times out of 100 that's fine—it’s no big deal: it’s about something banal.



But on this meeting, you walk in and surprise, there's someone from human resources there too, and they don't offer you coffee. First. I want to say the idea of calling people resources is crappy. HR—whatever you want to call it: people ops—but regardless, they're there; they're certainly not smiling, and they don't offer you coffee.



And that's the tell. When you're invited to a meeting that you weren't expecting and no one gives you coffee, it is not going to be a happy meeting. They usually have a folder sitting there on the table in front of them that has a whole bunch of paperwork in it. There's the, “This is the NDA that you signed, when you started your job here; it's still enforceable: We're reminding you of it paperwork.” There's a last paycheck and a separate paycheck of your cashed out vacation time in jurisdictions where that gets paid out, like California. And often, there's another contract there. This is called a severance agreement. The company is going to pay you some fixed amount of money in return for absolving them of any civil claims that you may have had during the course of your employment. I'm not your attorney, but let me tell you what the right answer here is.



Whatever you do, do not sign that contract in that room, in that moment. You've just been blindsided; you don't have a job anymore; you're most definitely not at your best. And you're certainly going to be in no position to carefully read a nuanced legal document prepared by your employer’s attorney designed to constrain your future behavior. They may say, “Take all the time you want,” or they may imply they can't give you your last paycheck until you sign it. The Department of Labor would like a word with them if that's the case because that's not legal.



Thank them, leave with your head held high and bask for a moment in the freeing sense of no longer having any obligation to your now ex-employer. All the projects you had in flight, let them go. All the things you needed to tackle; the office politics: not your problem anymore. You're free. Now, in the next day or two, have an employment attorney read through that agreement and give their advice.



Usually, there's a payment of some varying amount of money, and in return for that payment, you agree to a few things. You'll waive them of civil liability for any claims you may have, and in many cases, you're going to be asked to agree to a non-disparagement clause. That means that you agree not to say anything critical or disparaging about your previous employer. There's also a separate NDA as a part of these things that is going to be preventing you from disclosing the existence of that agreement. This incidentally is why I have nothing disparaging or critical to say about a number of my previous employers. In addition to crapping on your old job being a generally terrible look, it might lead to having to return a small to mid-sized check. Now, let's get to the fun part of the story—by which I mean the technical bits—after this.



This episode is sponsored in part by N2WS. You know what you care about? Many things, but never backups. At least until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers; you can back things up cost-effectively, and safely. For a limited time, N2WS is offering you $100 in AWS credits for setting up their free trial, and I encourage you to give it a shot. To learn more, visit snark.cloud/n2ws. That's snark.cloud/n2ws.



So, I have a weird thing that I've noticed when getting fired from companies. My job was invariably always either running the ops team, or on the ops team and, in a couple places, being the entire ops team. That meant that I had full access to basically everything: the full production environment, the AWS account, the secret store, the shared password manager, the domain registration account that the founder used, all of it. So, for companies that aren't practiced at exercising a rigorous termination policy and procedure—which in my case was most of them because I worked in small business—an awful lot of things have to change, and quickly.



You can't fire someone and then walk around for a couple days with them still having access to your systems. That is terrible policy; not everyone's going to be level headed; and you never know when someone's going to do something ill-advised. Now, as you might imagine, when you're working inside of a small company, and there's an infinite amount of work to do, building out a policy by which you quickly and efficiently lock someone out—who has full access to pretty much everything—isn't the first thing you focus on until one day suddenly it absolutely is because someone just got fired. Now, before I was the greatest cloud influencer in the world, remember, if you will, that I started my career as a grumpy Unix sysadmin. That means that despite all of my modern sensibilities here in 2020, I still had a personal server or two hanging out somewhere that was completely disconnected from everything corporate. That's where I kept all of my stuff that I cared about: my side projects, etc. That server in various forms has saved my bacon more than once because one of the things I always taught it to do, out of habit, when I started a new ops job was to keep an eye on my employer’s public sites.



There are certain classes of systems or network failure where everything—even the monitoring systems themselves—break. I always viewed that personal system there—with a cron job that would ping things with Nagios or whatnot—as sort of a watchtower of last resort. I never put anything confidential onto it, for obvious reasons. It was strictly looking at public-facing endpoints. There were no credentials. It wasn't logging in. It wasn't a great check. But, “Hey, is the site up?” is more or less what it was looking at.



There have been two occasions in my career during which time that system caught outages that other things didn't. Because frankly, monitoring is freakin’ terrible, and sometimes that's what it takes is a third party that's not connected to anything. Multiple times during my career, either when being fired in that meeting or shortly after that meeting, that system sent me it's, “Hey, your website's down, genius,” message. Now, I don't have inside info of what happened next because, again, I was fired. But what I strongly suspect happened was that companies were doing the right thing by revoking all of my credentials and all of the shared credentials, but when they did it, they did so unevenly, and would, for example, reset a database password before teaching the application about the new one, and then the whole site goes down.



I do want to point out for a minute here that I have a functioning sense of ethics. I don't actually know for a fact that my previous employers ever turned off my accounts, or that this is what caused the next day or two of sporadic outages of their public-facing websites in some cases. I'm theorizing wildly here. I mean, what was I going to do, log in with those now dead credentials—or try to log in and check? Yeah, that is an ethical breach. Don't do that.



So, the trick and the takeaway here is, you should have a plan for what it looks like when someone leaves the company before you need to figure it out in a hurry and probably take the website down. And credit where due: there's a villain in every story, and in this case, it's me. I admit that the lack of a policy and procedure in place to safely rotate credentials when firing someone does, as an ops person, fall squarely on my shoulders. I failed in that aspect of my job, so I guess it's a good thing that I was fired.



This has been the AWS Morning Brief: Whiteboard Confessional. I am Cloud Economist Corey Quinn. And if you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you've hated it, please leave a five-star review on Apple Podcasts and tell me another reason I should have been fired.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 12 Jun 2020 03:00:00 -0700
Enduring the Cloud Migration Factory
AWS Morning Brief for the week of June 8, 2020.
Mon, 08 Jun 2020 03:00:00 -0700
Whiteboard Confessional: The Time I Almost Built My Own Email Marketing Service

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Links

Transcript
Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



nOps
will help you reduce AWS costs 15 to 50 percent if you do what tells you. But some people do. For example, watch their webcast, how Uber reduced AWS costs 15 percent in 30 days; that is six figures in 30 days. Rather than a thing you might do, this is something that they actually did. Take a look at it. It's designed for DevOps teams. nOps helps quickly discover the root causes of cost and correlate that with infrastructure changes. Try it free for 30 days, go to nops.io/snark. That's N-O-P-S dot I-O, slash snark.



Welcome once again to the AWS Morning Brief: Whiteboard Confessional. Today I want to talk about, once again, an aspect of writing my Last Week in AWS newsletter. This goes back to before I was sending it out twice a week, instead of just once, and my needs weren't that complex. I would gather a bunch of links throughout the week: I would make fun of them and I had already built this absolutely ridiculous system that would render all of my sarcasm from its ridiculous database where it lived down into something that would work to generate out HTML. And I've talked about that system previously, I'm sure I will again. That's not really the point of the story.



Instead, what I want to talk about is what happened after I had that nicely generated HTML. Now, I've gone through several iterations of how I sent out my newsletter. The first version was through a service known as Drip, that's D-R-I-P. And they were great because they were aimed at effectively non-technical folks, by and large, where it’s, “Oh, you want to use a newsletter. Go ahead.”



I looked at a few different vendors. MailChimp is the one that a lot of folks go with for things like this. At the time I was doing that selection, they were having a serious spam problem. People were able to somehow bypass their MFA. Basically, their reputation was in the toilet and given my weird position on email spam, namely, I don't like it, I figured this is probably not the best time to build something on top of that platform, so that was out.



Drip was interesting, in that they offered a lot of useful things, and they provided something far more than I needed at the time. They would give me ways to say, “Okay, when someone clicks on this link, I can put them in different groups, etcetera, etcetera.” You know, the typical email, crappy tracking thing that squicks people out. Similarly to the idea of, “Hey, I noticed you left something in your cart. Do you want to go back and buy it?” Those emails that everyone finds vaguely disquieting? Yeah, that sort of thing. So, 90 percent of what they were doing, I didn't need, but it worked well enough, and I got out the door and use them for a while.



Then they got acquired, and it seemed like they got progressively worse month after month, as far as not responding to user needs, doing a hellacious redesign that was retina searingly bad, being incredibly condescending toward customer complaints, subtweeting my co-founder on a podcast, and other delightful things. So, the time came to leave Drip. So, what do I do next? Well, my answer was to go to SendGrid. And SendGrid was pretty good at these things. They are terrific at email deliverability—in other words, getting email, from when I hit send on my system into your inbox, bypassing the spam folder, because presumably, you've opted in to receive this and confirmed that you have opted in—so that wasn't going to be a problem.



And they still are top-of-class for that problem, but I needed something more than that. I didn't want to maintain my own database of who was on the newsletter or not. I didn't want to have to handle all the moving parts of this. So, fortunately, they wound up coming out with a tool called Marketing Campaigns, which is more or less designed for kind of newsletter-ish style things if you squint at it long enough. And I went down that path and it was, to be very honest with you, abysmal.



It was pretty clear that SendGrid was envisioning two different user personas. You had the full-on developer who was going to be talking to them via API for the sending of transactional email, and that's great. Then you had marketing campaign folks who were going to be sending out these newsletter equivalents or mass broadcast campaigns, and there was no API to speak of for these things. It was very poorly done. I'm smack dab between this, where I want to be able to invoke these things via API, but I want to also be able to edit it in a web interface, and I don't want to have to handle all the moving parts myself. So, I sat down and had a long conversation with Mike, my business partner. And well, before I get into what we did, let's stop here.



This episode is sponsored in part by N2WS. You know what you care about? Many things, but never backups. At least until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers; you can back things up cost-effectively, and safely. For a limited time, N2WS is offering you $100 in AWS credits for setting up their free trial, and I encourage you to give it a shot. To learn more, visit snark.cloud/n2ws. That's snark.cloud/n2ws.



So, we looked at ConvertKit, which does a lot of things right, but there are a few things wrong. There's still no broadcast API, you have to click things to make it go. So, I have an email background, Mike has an engineering background. And we sat there and we've decided we're going to build something ourselves to solve this problem. And we started drawing on a whiteboard in my office. This thing is four feet by six feet, it is bolted to the wall by a professional because I have the good sense to not tear my own wall down, instead hiring someone else to do it for me.



And we spent a couple of hours filling this whiteboard with all of the features we'd want this to have and how we would implement them. It would require multiple things databases because we wanted to have an event system when someone clicked the link to opt themselves out, we wanted to make sure they got opted out of that thing. But did we want them to opt out of everything like the t-shirt announcements that we run? Now, that we have multiple newsletters, we want people able to opt in for the Wednesday, but not the Monday, and vice versa. So, we realized at the end of a few hours of work that A) we had completely filled this enormous whiteboard. And 2. We had effectively built ConvertKit from first principles and fixed a couple of weird bugs along the way, and that was really it.



At which point we looked at each other realized we're being patently ridiculous, threw the whole thing away and became ConvertKit customers, which we remain to this day. There are still a couple of annoying edge case issues that drive us nuts, but by and large, it was the right decision. And what I want to talk about today is that this is a common pattern. And very frequently, the resolution of the story is different, where, “No. We’re going to go ahead and build this thing ourselves.”



Remember, we have a consultancy that fixes AWS bills. And the marketing, such as it is for that, is this podcast, the other podcast, and the newsletter. We're viewed in some ways as a media company, but we're really not. I view what I do, in terms of this sort of thing, as being the marketing function for the consultancy: it gets our name out there. You'll notice that there's no direct call to action here of, “That's why you should call me to fix your AWS bills,” though you should. It's about name recognition so that when someone has a problem one day, we pop to mind when that problem becomes an AWS bill. Our job is not to build an email system from first principles. We are not a product company. Building a tool like that internally should only be done if it adds differentiated value that we can't get anywhere else.



Now sure, if you’re Google or Amazon, you will go ahead and build an email marketing service like that. It makes sense: if you're Amazon, you're going to build Pinpoint and completely biff it for this use case, because it tries to be too many things at once to a market they don't fully understand. But if you aren't either at that scale, or if you're not directly aligned with that as your core business function, then find something that gets you most of the way there, buy it, and move on. Otherwise, I would never have time to write the newsletter, because I'm too busy fixing the system that makes it send to your inboxes. That's an awful place to find myself. I don't want to build SendGrid plus one; I don't want to build a competitor to ConvertKit, because those things are not what I want to spend my time on. I don't find that I can add anything differentiated in that space. And it is such a colossal distraction and waste of effort from the thing that actually does add value, namely me being sarcastic and crappy towards various cloud services here and other places every week.



So, what I'm saying here is, when you're about to build something, make sure that it aligns with the overall strategy your business is undertaking. Remember, every line of code you write is something you're going to have to maintain yourself. Now, I just write a check to ConvertKit, and when things there break, as they tend to occasionally because surprise, that's what computers do, I either, if I'm being aggressive, I open a ticket. If I'm not, I just wait and the problem goes away on its own because that is their core competency; that is what they're focusing on. I don't care who you use for an email service provider. Truly, I don't, I don't have a horse in that race.



If I were starting this newsletter over from scratch, I would almost certainly go with something like Review, or curated.co, or Constant Contact, or Campaign Monitor maybe, or Vertical Response, or a whole bunch of other different companies in this space that solve for this. Don't go ahead and build your own unless you somehow have a take on email newsletters that is going to transform an entire industry and you're going to become a company where that is what you do as a SaaS product. Otherwise, don't do that. You're going to distract yourself and lose focus. And honestly, forget money; forget customers; forget attention. Focus is the number one thing that companies seem to run out of the most.



This has been the Whiteboard Confessional. I am Cloud Economist Corey Quinn. And if you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Otherwise, please leave a five-star review on Apple Podcasts anyway, and tell me why my choice in email provider is crap.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.


Fri, 05 Jun 2020 03:00:00 -0700
AWS Security Landscapers
AWS Morning Brief for the week of June 1, 2020
Mon, 01 Jun 2020 03:00:00 -0700
Whiteboard Confessional: The Core Problem in Cloud Economics

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links



Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



nOps
will help you reduce AWS costs 15 to 50 percent, if you do what tells you. But some people do. For example, watch their webcast, how Uber reduced AWS costs 15 percent in 30 days; that is six figures in 30 days. Rather than a thing you might do, this is something that they actually did. Take a look at it. It's designed for DevOps teams. nOps helps quickly discover the root causes of cost, and correlate that with infrastructure changes. Try it free for 30 days, go to nops.io/snark. That's N-O-P-S dot I-O, slash snark.



Corey: Welcome to the AWS Morning Brief: Whiteboard Confessional. Today we're going to tell a story that only happened a couple of weeks ago. We don't usually get to tell stories about what we do in the AWS bill fixing realm because companies are understandably relatively reticent to talk about this stuff in public. They believe, rightly or wrongly, that it will annoy Amazon which, frankly, is one of my core competencies. They think that it shows improper attention to detail to their investors and others.



I don't see it that way, but I found a story that we can actually talk about today in a bit more depth and detail than we normally would. So, we get a phone call about three weeks ago. Someone has a low five-figure bill every month on AWS. That's generally not large enough for us to devote a consulting project to, because frankly the ROI would take far too long. It's too small of a bill to make an engagement with us cost-effective, but we had a few minutes to kill and thought, “Eh, go ahead and pull up your bill. We'll take a quick look at it here.” Half of their bill historically—and growing—had been data transfer which, okay, that's interesting. And as we've known from previous episodes, data transfers is always strange. So, they looked at this and they said, “Okay, well, obviously then instead of serving things directly, we should get a CDN.”



So, then instead of getting a CDN, they chose to set up CloudFront, which basically is a CDN only worse in every way. And they saw no impact to their bill after a month of this. Okay, let's change that up a bit. Now, instead of CloudFront. We're going to move to an actual CDN. So, they did, there was a small impact to their bill. Okay, and costs continue to rise and what's going on? At this point in the story, they call us, which generally if you're seeing something strange on your bill, is not a terrible direction to go in. We see a lot of these things and if we can help you, we will point you towards someone who can. So, our consensus on this was, great. It is too small to look at the bill.



But let's pop into Cost Explorer and see what's going on. We break it down by service and S3 was the biggest driver of spend. Now, that's interesting. Number two was EC2. But okay, we start with the big numbers and work our way down. This is apparently novel for folks doing in-depth bill analysis, but we're going to go with it anyway. We start taking a look within that S3 category of usage type, and lo and behold, data transfer out to the internet is driving almost all of it. The cost per request is super low. That tells us in turn that—because we've seen a lot of these—that there are large objects, but relatively few requests for them.



So, all right, we're going to slice and dice slightly differently within Cost Explorer, AWS’s free—with an asterisk next to it—tool for exploring various aspects of your bill. That asterisk, incidentally, means that if you're doing this via API, it is one cent per call. If you're doing this in the console, it's free. Be aware, that can catch you by surprise if you write a lot of very chatty scripts. You have been warned. So, yeah, most of the spend was indeed on GetObject calls. So, okay, we know that data transfer spend was coming from an S3 bucket that was not going to a CDN. Otherwise, it's going to show up as a different data transfer charge in a different section of their bill.



Okay, so now we know it's S3. We have to figure out what bucket it lives within. And this is an obnoxious process, and we tell them this. And they’re like, “Oh, yeah, we know what bucket that is.” This, incidentally, is where almost every software tool tends to fall down. We could spend some time tracking this down programmatically. Or we can just ask someone who already has the context of what their business does and how it works loaded into their head. Because otherwise, what we'd have to do is tag all their buckets, and then wait a few days for that tag to percolate into the billing system, and then query it again because the visibility into this sort of thing is terrible. It's a shortcoming of both Cost Explorer and S3 in that weird seam between the two. There's fundamentally no easy way to see at a glance which buckets are costing you money unless you do something fun with tagging.



To that end, I'm a big believer in having every bucket tagged with a bucket name option. So, I can start slicing on that, and then you enable that as a cost allocation tag. So, great. Now, what is it that's getting requested? Well, you can also dig into this via access logs once you have the bucket, to see what's going on. Great. Now, we take a look at this, and sure enough—well, before I go into what it actually was, let's pause here for a moment.



This episode is sponsored in part by N2WS. You know what you care about? Many things, but never backups. At least until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers; you can back things up cost-effectively, and safely. For a limited time, N2WS is offering you $100 in AWS credits for setting up their free trial, and I encourage you to give it a shot. To learn more visit snark.cloud/n2ws. That's snark.cloud/n2ws.



Corey: So, this bucket was set to public access. Aha! it only allowed GetObjects, so you couldn't ListObjects. And you couldn't upload objects into it, so my famous “get people's attention by copying $4 million worth of data into their bucket” approach of finding the problem wouldn't work here. So, great. The solution to this offhand was to restrict access to that bucket to just be the CDN, so that you couldn't access it directly from the larger internet, but only from the CDN that was accessing it. Because it was using signed request from the CDN, this was pretty easy to do.



Now, there are a few things to point out here. One—and this is going to take some people aback, and cause them to gasp—but this company in question worked in the adult entertainment space. So, we now have a rough idea of what that tends to be in these buckets: generally large media files. And when you can discover this type of large media files, something that we've learned from the dawn of the internet is regardless of your bandwidth, your constraints, etcetera, if you make this stuff available to folks for free, and public, the usage and consumption of that will grow to fill whatever your constraint is. Historically, network-based constraints. But in the world of cloud, instead, it's a budgetary constraint.



So, the solution here was to shut down the quote-unquote, “backdoor” that let people query the bucket directly, and only access it through the CDN. Now, this all came from a half-hour of exploration on the phone with someone. And the reason we're able to do it that quickly is because we see this stuff, kind of, a lot. Maybe not quite this egregiously, but it's definitely a pattern that we know and recognize. The first time we see it, we got to spend a few days doing a deep dive into it. Now, it feels like it’s, oh, it's like that one other time we saw this. The second time you look like a wizard from the future. Mastery is never as far away from these things as we'd like to pretend it is.



Now, the solution that we had was pretty decent, because it basically removed almost all of this. Now, there are only S3 egress charge from that bucket is to their CDN for legitimate paying customers accessing via a locked down signed request model, but it solved their problem. There is, however, a better answer to this. And unfortunately, it's not one that we are able to implement at all.



And that better solution is for AWS themselves to maybe notice that half of a freakin’ bill is S3 data transfer. And maybe when that happens in the five-figure-a-month range, it's worth possibly flagging this for review, and checking in with the customer. “Hey, are you aware that this is the case?” Or analyzing what's going on in the account? Maybe—and I'm just spitballing here—having the account manager reach out to figure out, A) what's going on? And B) is this normal and expected? And then, even if you want to add a bonus C) on top of that, and by the way, I am your account manager, and if you need anything, please don't hesitate to reach out. What have you got planned? How can we help you as your cloud provider? It really leads to a better outcome for everyone, rather than having stories like this show up.



I'm not saying this to bag on S3, I'm not saying this to bag on data transfer pricing—much. And I'm not saying this to bag on AWS as a whole. But there needs to be a better holistic and far more systemic way of analyzing what's going on in various customer accounts, and when things fall into certain profiles. Just a quick check-in with that customer can go an awfully long way. Because we found this, and the company in question is thrilled with us. They're not so thrilled with AWS. If AWS had proactively pointed this out, it would have been a better experience for everyone.



This is the core problem that we see in cloud economics. You can charge customers for an awful lot of things, but make sure that A) they know what they're being charged for, and B) it doesn't surprise them. When they discover, “Aha, I found the misconfiguration that was driving 50% of my bill,” that doesn't feel good for anyone. And it doesn't lead to people continuing to invest in whatever cloud provider they're in. This has been my low key rant about billing. This is the AWS Morning Brief: Whiteboard Confessional. I am Cloud Economist Corey Quinn, fixing AWS bills here in San Francisco or, more directly, on the internet because that's where they all live. And if you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you’ve hated it, please leave a five-star review on Apple Podcasts and a copy of your latest AWS bill.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.


Fri, 29 May 2020 03:00:00 -0700
Introducing AWS SnowCannon
AWS Morning Brief for the week of May 25, 2020.
Mon, 25 May 2020 03:00:00 -0700
Whiteboard Confessional: Naming Is Hard, Don’t Make it Worse

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links



Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



nOps
will help you reduce AWS costs 15 to 50 percent if you do what tells you. But some people do. For example, watch their webcast, how Uber reduced AWS costs 15 percent in 30 days; that is six figures in 30 days. Rather than a thing you might do, this is something that they actually did. Take a look at it. It's designed for DevOps teams. nOps helps quickly discover the root causes of cost, and correlate that with infrastructure changes. Try it free for 30 days, go to nops.io/snark. That's N-O-P-S dot I-O, slash snark.



Good morning AWS, and welcome to the AWS Morning Brief: Whiteboard Confessional. Today we're going to revisit DNS. Now, now, slow down there, Hasty Pudding. Don't bother turning the podcast off. For once, I'm not talking about using it as a database… this time. As you're probably aware, DNS is what folks use to equate friendly names for twitterforpets.com, or incredibly unfriendly names like Oracle.com, to IP addresses, which is how computers tend to see the world. I'm not going to rehash what DNS does.



Instead, I'm going to talk about a particular kind of DNS problem that befell a place I used to consult for. They're publicly traded now, so I'm not going to name them. An awful lot of shops do something that's called split-horizon DNS. What that means is that if you're on a particular network, a DNS name resolves differently than it does when you're on a different network. For example, admin.twitterforpets.com will resolve to an administrative dashboard if you're on the Twitter For Pets internal network via VPN, but it won't resolve to that dashboard if you're outside the network, or it might resolve nowhere, or it might resolve just back to their main website, www.twitterforpets.com.



And that's fine. Most DNS providers can support this, and Route 53 is, of course, no exception. This is, incidentally, what the Route 53 resolver, that was released in 2018, is designed to do: it bridges private DNS zones to on-premises environments, so your internal zones can then resolve to private IP addresses without having to show your private IP address ranges in public zones to everyone. So, the reason that matters is that this keeps you from broadcasting your architecture or your network layout externally to your company. Some folks consider doing that to be a security problem because it discloses information that an attacker can then leverage to gain further toeholds into your network. Some folks also think that that tends to be a little bit on the extreme side. I'll let you decide because I don't care, and that's not what the story is about.



The point is that split-horizon DNS is controversial, for a few reasons, but in many shops, it is considered the right thing to do because it's what they've been doing. The internal DNS names either don't resolve anything publicly, or they resolve to a different system that’s configured to reject the request outright. But there is another path you can take; a third option that no one discusses because it's a path that's far darker, because it is oh, so very much dumber. But first…



This episode is sponsored in part by N2WS. Do you know what you care about? Many things, but never backups. At least until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers; you can back things up cost-effectively, and safely. For a limited time, N2WS is offering you $100 in AWS credits for setting up their free trial, and I encourage you to give it a shot. To learn more visit snark.cloud/n2ws. That's snark.cloud/n2ws.



What I'm about to describe is far too stupid for my made-up startup of Twitter For Pets, so we're going to have to invent a somehow even dumber company, and we're going to call it Uber For Squirrels. It's like regular Uber, except it somehow manages to lose less money. Now, there's a very strong argument among the engineering community inside of Uber For Squirrels. Split-horizon DNS is dangerous is what is decided and argued for. And that's the proclamation because a misconfiguration could leak records in the wrong places, and theoretically take the entire online site for Uber For Squirrel down. There are merits to those arguments and you can't dismiss them out of hand, so a bargain was struck.



The external DNS zone was therefore decreed to be uberforsquirrels.com, while the internal zone was configured to be uberforsquirrels.net. The uberforsquirrels.net zone was only accessible inside of the network. From the outside, nobody could query it. Now, this is, in isolation—before I go further—a bad plan all on its own. When you're reading quickly, uberforsquirrels.com and uberforsquirrels.net don't jump out visually to people as being meaningfully different. You're going to typo it in config files constantly without meaning to, and then you're going to have a hell of a time tracking it down because it's not immediately obvious that you're talking to the wrong thing; you might think it's a network problem. Your tab completion is going to break out of your known_hosts file, if you have such a thing configured in your environment, it's going to have to hit tab a couple of extra times to cycle through the dot net variants and the dot com variants. It's just a general irritant.



But that's not enough to justify an episode of the show. Because wait, that is still some Twitter For Pets level brokenness. Why do I need to throw Uber For Squirrels under the bus? Well, because it turns out that despite using uberforsquirrels.net everywhere as their internal domain, they didn't actually own uberforsquirrels.net. It wasn't entirely clear who did other than that the registration was in another country, so it probably wasn't something that the CEO registered and then forgot about in his random domain list of things he acquired for companies he was going to start one day. And that zone itself was wildcarded to a single IP address. And what that means is that no matter what you typed in, admin.uberforsquirrels.net, vpn.uberforsquirrels.net, payments.uberforsquirrels.net, it all landed on the same place on a single server.



And that server had some interesting things configured on it. HTTP it would listen on, HTTPS, SSH, and many other listeners were hanging out on that server. Just sat there listening on basically every port to every protocol. It would silently wait for connections and then let you send it whatever it is you wanted. So, if you weren't on the VPN when you thought you were, boom, you just typed your credentials into some randos web server. They even had a wildcarded email server set up. Anything emailed to any username at all at uberforsquirrels.net would go through, and it was never clear what happened to it afterwards.



“Damn it, I hit the dot net, again!” was the rallying cry in the Uber For Squirrels engineering halls. And then, it was time for yet another credential rotation. Now, maybe this person who set this up had no idea what chaos they caused. Maybe they did it intentionally. Maybe they were a disgruntled former employee; I don't know. What I do know is that one day, the domain was transferred to the company by way of an escrow service. So, I can only assume that that person was in turn given an eye-wateringly large check. Good for them. I mean, that is the type of mistake that was easily avoided, if only someone had been paying attention. By the time people realize the trouble that they were in, it was too late because changing all of your systems to use a different DNS zone entirely is non-trivial.



Now, I try not to fill this podcast with stories of things that broke once in a weird way that couldn't possibly ever recur again. There should, ideally, be a moral to every episode, something you can take with you. And the idea is that there's a takeaway here, something that you can do to make your own environment better. So, here you go. This episode is no exception to that general trend. It is imperative that you own all of the domains you use, regardless if they're internal, or external. And that includes domains that don't exist. For a long time, there was a finite list of publicly resolvable top-level domains, so folks would take liberties with the rest. Internal domains would be set to companyname.corp, development domains would be companyname.dev, and production domains would be dot prod.



And then, the chuckle-fucks at ICANN—that’s I-C-A-N-N—the group that regulates all of these things, decided that they like money a lot. And they put up a system by which anyone could get their own top-level domain if they make a good enough argument for it and cut an $80,000 check to ICANN. For example, dot aws is now a domain. You go to amazon.aws and that will resolve.



Chime.aws is a domain, but they refuse to give me lastweekin.aws. It's sad, and if you're listening and can help with that, please reach out. But this entire problem was then made even worse by the chuckle-fucks at Google because they did something right. Namely, they bought the dot prod domain, and then they're sitting on it so it doesn't resolve, so people aren't going to be sending company secrets all over the place, but they also bought dot dev. And they opened up dot dev to anyone who wanted to register any domain, which means that if your company uses yourcompanyname.dev as an internal testing domain, understand that if someone registers that domain, they can set up the exact same listening problem I've just described. Don't make the same mistake. Check your internal domains, check your testing domains, and make sure you own it. Then point it to something that you control, so you don't have to wonder who just sent company secrets to the wrong place. Domains are not expensive; data breaches, very much are.



This has been another episode of the AWS Morning Brief: Whiteboard Confessional. I am Cloud Economist Corey Quinn, and if you've enjoyed this episode, please do me a personal favor and leave a five-star review on Apple Podcasts. Whereas if you've hated this, please leave a five-star review on Apple Podcasts, and then send your complaint to lastweekinaws.net.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.


Fri, 22 May 2020 03:00:00 -0700
Amazon Macie Some Well Deserved Pushback

AWS Morning Brief for the week of May 18, 2020.

Mon, 18 May 2020 03:00:00 -0700
Whiteboard Confessional: You Down with UTC? Yeah, You Know Me

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links



Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



nOps will help you reduce AWS costs 15 to 50% if you do what it tells you. But some people do, for example, watch their webcast, "How Uber reduced AWS costs 15% in 30 days". That is six figures in 30 days. Rather than a thing you might do, this is something that they actually did. Take a look at it. It's designed for dev ops teams nOps helps quickly discover the root causes of costs and correlate that with infrastructure changes. Try it free for 30 days. Go to nops.io/snark. That's nops.io/snark.



Today I want to talk about a funny thing: time. Time has taken on a different meaning for many of us during the current pandemic. Hours seem like days. Days seem like months. But in the context of computers, time is a steady thing. Except when it's not. Things like leap years, leap seconds, Google's famous leap smear and, of course, our ever-changing friends, time zones, combine and collude with one another to make time a very hard problem when it comes to computers. In the general case, computers think of time in terms of seconds since the start of the Unix epoch on January 1, 1970. This is incidentally—and not the point of this episode—going to cause a heck of a lot of excitement when 32-bit counters rollover in 2038. But that's a future problem similar to Y2K, that I'm sure won't bother anyone.



Time leads to suboptimal architectural choices, which is bad, and then those choices become guidance which is in turn far, far worse. Now, AWS has said a lot of things over the years that I despise and take massive issue with. Some petty and venial, like pronunciation, but none of them were quite so horrifying as a tweet. On May 17, 2018, the official AWS Cloud Twitter account tweeted out an article with the following caption, “Change the timezone of your Amazon RDS instance to local time.” I hit the roof immediately and began ranting about it and railing against that tweet in particular.



I believe this is the first time that me yelling at AWS in public hit semi-viral status. My comment, to be precise, was absolutely do not do this. UTC is the proper server timezone unless you want an incredibly complex problem after you scale. Fixing this specific problem has bought consultants entire houses in San Francisco. Now, I stand by that criticism and I maintain that your databases should be in UTC at all times, as should the rest of your servers. And I'll explain why, but first:



This episode is sponsored in part by N2WS. You know what you care about? Many things, but never backups. At least, until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers so you can back things up cost effectively and safely. For a limited time N2WS is offering you a hundred dollars in AWS credits for setting up their free trial. And I encourage you to give it a shot. To learn more, visit snark.cloud/n2ws. That's snark.cloud/n2ws.



It's important that all of your systems be in the same timezone UTC, or Universal Time Coordinated doesn't change with the seasons. It doesn't observe daylight saving time. It's the closest thing we've got to a unified central time that everyone can agree on. Now, you're going to take issue with a lot of that, and I'm not suggesting that you should display that time to your users. You have a lot of options around how you can alter the display of time at the presentation level. You can detect the timezone that their browser is set to. You can let them select their time zone in the settings of your application. You can do what ConvertKit—one of my vendors—does, and force everything to display in US East Coast time for some godforsaken reason. But all of those options are far better than setting the server time to local time.



Years ago, I've been told that this shameful secret exists within companies during job interviews when I asked what kind of problems they're currently wrestling with, and it's a big deal because changing one system requires changing every system that winds up tying back to that. Google apparently had all of their servers originally set to Pacific Coast time or headquarters time, and this caused them problems for over a decade. I can't confirm that because I haven't ever worked there, so I wouldn't know other than stories people tell while sobbing into beers. But it stands to reason because once you've gone down this path, it is incredibly difficult to fix it.



What's not so obvious is why exactly this is so painful. And the problem comes down to change. Time zones change. Daylight saving time alters when it takes place in the given location from year to year. And time zones themselves don't hold still either, as geopolitical things tend to change.



Remember that computers don't just use time to tell you what time something is right now. They look at when log entries were made, what happened in a particular time frame? What was the order of those specific events that all came from different systems? When was a change actually implemented? And you really, really don't want to have to apply complex math to logs just to reconstruct historical events in time. “Well, that one was before daylight saving time took effect that year in that particular location where the server was running in, so just carry the two.” That becomes awful stuff, and no one wants to have to go through that. It also leads to scenarios where you can introduce errors with bad timezone math.



Now, there are a couple of solid objections here, but one of the only ones that I saw advocated on Twitter when I started ranting about it was of the very reasonable form, “Look, most stuff that uses databases in a lot of companies is for a single location at a single company, and it's never going to need to scale beyond that company.” And I don't take issue with that argument. In fact, I would argue that it's correct. I think that one of the problems we have in technology is that we tend to assume everything we build has to be scalable and ready to go across the globe at a moment's notice. And it simply doesn't happen the vast majority of the time, but it's still a terrible direction to go in for something like this because there's going to be an exception that suddenly does require that eventually, and having to unwind all of the time decisions that you've made in an opinionated way in your system is incredibly painful and difficult. Whereas just starting from the beginning, and understanding that all of your systems are going to be operating in UTC makes this a heck of a lot easier. Nobody builds most things expecting to have to scale them massively, but sometimes you get lucky, and surprise.



Fixing these things after the fact takes tears, and sadness, and fire, and brimstone, and articles on the internet, and people thinking you're dumb when you talk to them about your current architecture in job interviews, and it just doesn't lead anywhere good. So, if you have a choice, and you're debating this in any way, please set your system to Universal Time, my developer instance which is only ever used by me still uses Universal Time. Now, I have something set up in my Unix user profile that will set my timezone of preference to my user. And that's what it displays on the command line and in other things when I tell it to ls or look at a [00:09:15 bunch] of my files, what times are they edited in. But that's all done, again, at the presentation layer. The system time itself is set to UTC.



Now, to their credit, AWS has since edited the how-to guide that their tweet linked to, and it starts off by saying that changing an RDS system to local time is strongly not recommended. And I do accept and admit that there are times where you're going to need to change RDS to a local timezone just to ensure compatibility with previous poor decisions that you're already suffering from. But it shouldn't be done as a best practice, and it certainly shouldn't be advertised as something that's a good idea. And when you do have to implement something like this, you should feel more than a little dirty about what you're having to do. Because it's not the best practice, it is absolutely something that is going to cause you pain down the road, possibly. And if you can possibly get away from it, you should.



This has been the AWS Morning Brief: Whiteboard Confessional. I am Cloud Economist Corey Quinn. If you've enjoyed this podcast, please leave a five-star review in Apple Podcasts. Whereas if you've hated it, please leave a five-star review on Apple Podcasts anyway, along with a comment telling me what time it is.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.


Fri, 15 May 2020 03:00:00 -0700
The AWS Machine That Goes PING
AWS Morning Brief for the week of May 11, 2020.
Mon, 11 May 2020 03:00:00 -0700
Whiteboard Confessional: Click Here to Break Production

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links


Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.



Today on the AWS Morning Brief: Whiteboard Confessional, I'm telling a different story than I normally do. Specifically, this is the tale of an outage from several weeks ago. The person who shared this story with me has requested to remain anonymous and further wishes me to not mention their company at all. This is, incidentally, a common occurrence. Folks don't generally want to jeopardize their relationship with AWS by disclosing a service issue they see, whereas I don't have that particular self-preservation instinct. Then again, I'm not a big AWS customer myself. I'm not contractually bound to AWS in any meaningful way, and I'm not an AWS partner, nor am I an AWS Hero. So, all that AWS really has over me in terms of leverage is the empty threat of taking away my birthday. So, let's dive into this anonymous story. It's a good one.



A company was minding its own business, and then had a severity one incident. For those who aren't familiar with that particular designation, you can think of that as being the company's primary service fell over in an embarrassingly public way. Customers noticed, and everyone runs around screaming a whole lot. Now, if we skip past the delightful hair-on-fire diagnosis work, the behavior that was eventually tracked down was that an SNS topic had a critical listener get unsubscribed. That SNS topic invoked said listener, which in turn drove a critical webhook call via API gateway. This is a bad thing, obviously.



Fundamentally, customers stopped receiving webhooks that they were expecting, and this caused a nuclear meltdown given the nature of what the company does, which I can't disclose and isn't particularly relevant anyway. But, for those who are not up to date on the latest AWS terminology, service names, and parlance, what this means at a high level is that a thing happens inside of AWS, and whenever that thing happens, it's supposed to fire off an event that notifies this company's paying customers. This broke because something somewhere unsubscribed the firing off dingus from the notification system. Now that we're aware of what caused the issue at a very high level, time to dig into how it happened and what to do about it. But first:



In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.



The logs for who unsubscribed it are, of course, empty, which is a problem for this company’s blameless-in-theory-but-blame-you-all-the-way-out-of-the-company-if-it-turns-out-that-it-was-you-that-clicked-this-thing-and-didn't-tell-anyone, philosophy. CloudTrail doesn't log this event because why would it? CloudTrail’s primary purpose is to rack up bills and take the long way around before showing events in your account, not to assist with actual problem diagnosis, by all accounts. Now, fortunately, this customer did have AWS Enterprise Support. It exists for precisely this kind of problem. It granted them access to the SNS team which had considerably more insight into what the heck had happened, at which point the answer became depressingly clear, as well as clearly depressing.



It turns out that the unsubscribe URL at the bottom of every SNS notification wasn't authenticated. Therefore, anyone who had access to the link could have invoked it, and that's what happened when a support person did something very reasonable: Copy and paste a log message containing that unsubscribe link into a team Slack channel. It wasn't their fault [00:06:04 unintelligible] because they didn't click it. The entity triggering this was—and I swear I'm not making this up—Slackbot.



Have you ever noticed that when you paste a URL into Slack, it auto expands the link to show you a preview? It tries to do that on every URL, and you can't disable URL expansion at the -Slack workspace level. You can blacklist URLs but only if the link expansion succeeds. In this case, it doesn't have a preview, so it doesn't succeed, so there's nothing for it to blacklist. Slack’s helpful feature can't be disabled on a team-wide level, so when that unsubscribe URL shows up in a log snippet that got pasted, it silently unsubscribed the consumer from SNS and broke the entire system.



Now, there are an awful lot of things that could have been different here. Isn't this the sort of thing that might be better off with SQS, you might reasonably ask? Well, four years ago, when this system was built, SQS itself could not, and did not support invoking Lambda functions, so SNS was the only real option. That's incidentally why it is so very important to have decent surface coverage when you launch an AWS service. Otherwise, stuff like this makes it into production and doesn't get revisited. And think about it for a minute. Why would it? If you have a thing that's working, why would it ever occur to you to go back and then have something like SQS or Kinesis invoke this system instead, now that that's supported because SNS is there? It's working. It's been working for four years. Until suddenly, today, it wasn't.



People don't go back and adjust things without a compelling reason. It never makes the priority list until something like this happens. It's a common pattern. You can also get around this by mandating that an unsubscribe request in SNS be authenticated rather than supporting anonymous unsubscribe. But that's arcane knowledge the first time. It's commonplace only the second time you encounter this.



For example, if you've listened this far, you now know it can be done. The parameter name, because you're going to need this, is AuthenticateOnUnsubscribe. And I'm sorry that you have to know that now. But again, this was working as it was for four years until something else external happened that impacted this.



There's a lot of things that could have been done differently, and I'm not casting blame on anyone. But I will say—looking at this from my perspective—now that I've told this story, look at your own environment and think for a minute. What links get tossed into Slack teams at various times? Would any of them cause problems if an unauthenticated user were to click on them? Because, fundamentally, that's what's happening under the hood, courtesy of Slackbot. Maybe consider whether or not that's a good pattern, and do something about it before it ends in disaster that touches your customers. And maybe consider that the next time, before you go all-in on the current ChatOps hype nonsense.



This has been the AWS Morning Brief: Whiteboard Confessional. I'm Cloud Economist Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. If you didn't enjoy this podcast, please leave a five-star review on Apple Podcasts anyway, and it's okay because you are probably Slackbot or work closely with Slackbot.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.


Fri, 08 May 2020 03:00:00 -0700
AWS Non-Profit Organisations
AWS Morning Brief for the week of May 4, 2020.
Mon, 04 May 2020 03:00:00 -0700
Whiteboard Confessional: Hacking Email Newsletter Analytics & Breaking Links

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links

Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.



On Monday, I sent out a newsletter issue to over 18,000 people where the links didn't work for the first hour and a half. Then they magically started working. Today on the AWS Morning Brief: Whiteboard Confessional. I'm not talking about a particular design pattern, but rather conducting a bit of a post mortem of what exactly broke and why it suddenly started working again an hour and a half later. To send out the Last Week in AWS newsletter, I use a third-party service called ConvertKit that, in turn, wraps itself around SendGrid for actual email delivery. They, in turn, handle an awful lot of the annoying difficult parts of newsletter management. As a quick example, unsubscribes. If you unsubscribe from my newsletter, which you should never do, I won't email you again. That's because they handle the subscription and unsubscription process.



Now, as another example, when you sign up for the newsletter, you get an email series that tailors itself to a “choose your own platypus” adventure based upon what you select. True story. Their logic engine powers that, too. ConvertKit is awesome for these things, but they do some things that are also kind of crappy. For example, they do a lot of link tracking that is valuable, but it's the creepy kind of link tracking that I don't care about and really don't want. Also, unfortunately, their API isn't really an API so much as it is an attempt at an API that an intern built, because they thought it was something you might enjoy.



I can't create issues via API. I have to generate the HTML and then copy and paste it in like a farm animal. And their statistics and metrics API's won't tell me the kinds of things I actually care about, but their website will, so they have the data, it just requires an awful lot of clicking and poking. And when I say things I don't care about, let me be specific. Do you know what I don't care about? Whether you personally, dear listener, click on a particular link. I do not care; I don't want to know. That's creepy; It's invasive, and it isn't relevant to you or me in any particular way.



But I do care what all of you click on in aggregate. That informs what I include in the newsletter in the future. For example, I don't care at all about IoT, but you folks sure do. So, I'm including more IoT content as a direct response to what you folks care about. Remember, I also have sponsors in the newsletters, who themselves include links, and want to get a number of people who have clicked on those things. So, it also needs to be unique. I care if a user clicks on a link once, but if they click on it two or three times, I don't want that to increment the counter, so there are a bunch of edge case issues here.



Here are the questions that I need to answer that ConvertKit doesn't let me get at extraordinarily well. First, what were the five most popular links in last week's issue? I also want to care what the top 10 most popular links over the last three months were. That helps me put together the “Best of” issues I'm going to start shipping out in the near future. I also care what links got no clicks because people just don't care about them or I didn't do a good job of telling the story. It helps me improve the newsletter.



With respect to sponsors, I care how each individual sponsor performs relative to other sponsors. If one sponsor link gets way fewer clicks, that's useful to me. Since I write a lot of the sponsor copy myself, did I get something wrong? On the other hand, if a sponsored link gets way more clicks than normal, what was different there? I explicitly fight back against clickbait, so outrage generators, like racial slurs injected into the link text are not permitted. So, therefore when a sponsored link outperforms what I would normally expect, it means that they're telling a story that resonates with the audience, and that is super valuable data. Now, I'll tell you what I built, and what went wrong. After this.



In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.



I built a URL redirector to handle all of these problems plus one more. Namely, I want to be able to have an issue that has gone out with a link in it, but I want to be able to repoint that link after I've already hit send. Why do I care about that? Well, if it turns out that a site is compromised, or throws up a paywall, and makes for a crappy reader experience, I want to be able to redirect that link to somewhere that is less obnoxious for people. It's the kind of problem you don't think about until one day it bites you when a site that you’ve linked against is hosting a hostile ad network and then gets your entire newsletter blacklisted by a whole bunch of spam gateways. You never make that mistake twice.



So, I set out to write some code to handle this very specific problem and its use case. So, I built all of this on top of serverless technologies. Because I'm a zealot, and too cheap to run an EC2 instance, it's a perfect setup, too, for the serverless model. The links get an awful lot of clicks on Monday morning and taper off throughout the day, then there's a long tail ghost town of clicks for the rest of the week. There are some, but not a lot. So, running something full-bore that entire time doesn't make a whole lot of sense. I used the Serverless Framework because that maps to how serverless technologies work in my mental model. It fits my canonical understanding of the serverless state of the world. There's a single function behind an API gateway. I tried to use one of the new HTTP APIs that they announced a few weeks ago, but the Serverless Framework support for that isn't great yet. I wanted to set up a custom domain for this thing, and the serverless plugin for domain management doesn't support that functionality, plus, all of the example docs for Serverless Framework don't leverage the new functionality.



So, my choices were either write a whole bunch of custom code and figure this stuff out myself as I guinea pig my way through it, or use the historical, long-standing API gateway instead. Now, API gateway is overly complex, it's more expensive, and it's super confusing, but I have a bunch of working examples how to make it go. So, that was awesome. I've already suffered through those slings and arrows once, I now know the path. And this monstrosity went through several iterations. First, it was a stateless service that worked via Lambda@Edge. The problem with stateless services for things like this is that it would automatically shorten any link that you throw against it. It turns out that if you don't have a list of approved links, that it will return redirects for, spammers can and will misuse your link shortener to send spam anywhere, so I had to rebuild it with that viewpoint in mind.



So, it's an API gateway backed by a single Python lambda function that's less than 100 lines in us-east-1. And when I say less than 100 lines, I'm talking around 60 or so. What it does is it receives an HTTP event from the API gateway. There's a unique, per user, user ID that's automatically injected by ConvertKit that I don't ever query for anything, but it's stored as a query parameter in that request. There's also a path parameter that winds up decoding into a dict that contains the URL and an issue number because a lot of links are going to occur in different issues. And I want to make sure that I get the aggregate numbers correct between issues. It then takes all that data and makes a single call to DynamoDB that does what is basically magic, that outstrips what Route 53, my normal preferred database, is capable of in this use case.



It validates that the URL in question exists in the database for the issue number that's given. If it is, it checks to see whether the per user ID that is provided has clicked on the link already. If they have, it just returns the URL. If the user hasn't clicked that link yet, it increments the click counter and adds the user ID to an array in the database that's never actually queried. It just winds up acting as a “Yep.” So, it becomes an atomic click. So, you clicking five times only shows up as one for me. Lastly, if the link isn't in the database, it instead returns a redirect to lastweekinAWS.com. All of that is done in a single update item DynamoDB call that Alex DeBrie helped me with. His DynamoDB book is absolutely fantastic. Go buy it if terrible stories like this are up your alley, and how to avoid doing things like this.



So, all of this works. Everything I've described works. What broke? Well, for starters, it wasn't supposed to go out yet, but when I pushed a small change to fix another bug, this got merged in by mistake. What I was doing was adding a /v1 to the API endpoint for this system, before I rolled it out next week. Because you always want to be able to version your APIs, and assorted nonsense. If I were to eventually redo this in something sensible that a grown-up might use, I can keep the same URL and just add a v2, then route it differently with API gateway, and now I haven't broken any of the old links, because it's obnoxious when you have an old email sitting in your old archive box that you want to click on, and it throws an error. I try to maintain these things in perpetuity.



Now, because all of this wasn't even set to be deployed, it hadn't been hooked into my test harness, which would have flagged all the breaking links when I went to publish this, because I sent out enough broken links over the years that I want to keep those to a minimum. But this completely slipped through, and gave a false positive across all of the tests, because I am bad at programming. Now, there are still some strange lingering questions as to why this broke, and the various ways that it did, and how it got up in the first place, that I'm still tracking down. But now you at least understand why it broke, how I was able to fix it without sending another newsletter—though I did, to tell people that it was fixing and then promise this episode— and, best of all, how to tell a story around something you wrote that broke in an embarrassingly public way and turn it into a podcast episode. Thanks for listening to the whiteboard confessional on the AWS Morning Brief.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.


Fri, 01 May 2020 03:00:00 -0700
Cape Town Region Is Expensive AF
AWS Morning Brief for the week of April 27, 2020.
Mon, 27 Apr 2020 03:00:00 -0700
Whiteboard Confessional: Don’t Run a Database on Top of NFS

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.



Links


Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



Corey: On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.



I talked a lot about databases on this show. There are a bunch of reasons for that, but they mostly all distill down to that databases are, and please don't quote me on this as I'm not a DBA, where the data lives. If I blow up a web server, it can have hilarious consequences for a few minutes, but it's extremely unlikely to have the potential to do too much damage to the business. That's the nature of stateless things. They're easily replaced, and it's why the infrastructure world has focused so much on the recurring mantra of cattle, not pets.



But I digress. This episode is not about mantras. It's about databases. Today's episode of the AWS Morning Brief: Whiteboard Confessional returns to the database world with a story that's now safely far enough in the past that I can talk about it without risking a lawsuit. We were running a fairly standard three-tiered web app. For those who haven't had the pleasure because their brains are being eaten by the microservices worms, these three tiers are web servers, application servers, and database servers. It's a model that my father used to deploy, and his father before him.



But I digress. This story isn't about my family tree. It's about databases. We were trying to scale, which is itself a challenge, and scale is very much its own world. It's the cause of an awful lot of truly terrifying things. You can build an application that does a lot for you on your own laptop. But now try scaling that application to 200 million people. Every single point of your application architecture becomes a bottleneck long before you'll get anywhere near that scale, and you're gonna have oodles of fun re-architecting it as you go. Twitter very publicly went through something remarkably similar about a decade or so ago, the fail whale was their error page when Twitter had issues, and everyone was very well acquainted with it. It spawned early memes and whatnot. Today, they've solved those problems almost entirely.



But I digress. This episode isn't about scale, and it's not about Twitter. It's about databases. So my boss walks in and as we're trying to figure out how to scale a MySQL server for one reason or another, and then casually suggests that we run the database on top of NFS.



[Record Scratch]



Yes, I said NFS. That's Network File System. Or, if you've never had the pleasure, the protocol that underlies AWS’s EFS offerings, or Elastic File System. Fun trivia story there, I got myself into trouble, back when EFS first launched, with Wayne Duso, AWS’s GM of EFS, among other things, by saying that EFS was awful. At launch, EFS did have some rough edges, but in the intervening time, they've been fixed to the point where my only remaining significant gripe about EFS is that it's NFS. Because today, I mostly view NFS is something to be avoided for greenfield designs, but you've got to be able to support it for legacy things that are expecting it to be there. There is, by the way, a notable EFS exception for Fargate and using NFS with Fargate for persistent storage.



But I digress. This episode isn't about Fargate. It's about databases.



Corey: In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.



So I'm standing there, jaw agape at my boss. This wasn't one of those many mediocre managers I've had in the past that I've referenced here. He was and remains the best boss I've ever had. Empathy and great people management skills aside, he was also technically brilliant. He didn't suggest patently ridiculous things all that often, so it was sad to watch his cognitive abilities declining before our eyes. “Now, hang on,” he said, “before you think that I've completely lost it. We did something exactly like this before at my old job, it can be done safely, sanely and offer great performance benefits.” So, I'm going to skip what happens next in this story because I was very early in my career. I hadn't yet figured out that it's better to not actively insult your boss in a team meeting, based only upon a half baked understanding of what they've just proposed. To his credit, he took it in stride, and then explained how to pull off something that sounds on its face to be truly monstrous.



Now I've doubtless forgotten most of the technical nuance here, preferring instead to use much better databases like Route 53. But the secret that made this entire monstrosity work was that we didn't just use crappy servers with an open-sourced file server daemon running on top of it as our NFS server. Oh no, we decided to solve this problem properly, by which I mean we use NetApp Filers. Now, I want to pause here to make a few points. First, and most importantly, NetApp is not a sponsor of this podcast in any way. I'm not here to shill for them. In fact, there are a laundry list of reasons not to use NetApps, not the least of which being is that they are, and remain, ungodly expensive. Second, over a decade later from the time this story takes place, there are way better ways to get the IOPS you need than shoving the mySQL data volume onto something that's being accessed from the database server via NFS. All of those ways are better than what we did.



Thirdly, in a time of Cloud as we are today, which we were assuredly not over a decade ago, you can't get a NetApp Filer shoved into us-east-1 without a whole lot of bribery and skullduggery. So, this model won't even work in a cloud environment in the first place. And fourthly, if you're still trying to do this with Cloud, you absolutely must be able to control the network absolutely. Between the Filer and the database servers, in a cloud environment, you're not going to be able to do that. So, this entire story is entirely off the table. But, if you're back in 2009, and you're trying to solve this problem with the exact constraints I've laid out, there are worse approaches you can take.



But I digress. This episode isn't about time travel. It's about databases. This led to other problems, too, once we got this thing up and running. Backups were painful, for example, because while NetApp Snapshots were now the right way to backup the data store, NetApp Snapshots, of course being awesome, we had to run a script on the database instances that were talking to those volumes in order to quiesce the database. The reason here is that databases are, to be very clear, terrible. To go into slightly more detail, you want all of your in-flight transactions to be written to disk, so the snapshot of your database volume isn't captured in an inconsistent state. If you capture a database volume in the middle of a write, there's a great chance that that backup won't be usable, or you'll have data corruption sneaking in.



This wasn't difficult, but it was annoying, because we figured this out before we had that problem, but it did mean a one- to two-second pause for everything that was using that database during the snapshot because it effectively had to block every write to that database, and reads, as it turned out, because of the way we structured things—see a previous episode of this show for more on that—but then we had to wait for the writes to finish blocking, the snapshot to complete, report that it completed, and then unlock the database again. That meant realistically a one- to two-second pause for everything that was using that database during that snapshot period, and you could see it on the graphs as plain as day. This wasn't a huge deal, but it definitely was annoying for some of the high-performance workloads involved.



We also, of course, had to stripe our database volumes across a whole bunch of spinning disks living in a NetApp shelf because this was during an era where large scale enterprise SSDs weren't really a thing. And as this continued to grow, you wouldn't buy just one more disk, you had to buy an entire shelf. So, there was definitely a step function approach to what this was going to cost at scale. It wasn't a nice linear ramp like you would see in a cloud provider. And of course, our NetApp reseller as a direct result, renamed their corporate jet after us because, oh my stars, was this entire thing expensive to pull off, start to finish. A modern cloud architecture is better than what I've just described in virtually every way unless you're a NetApp reseller. So, now at least you know a little bit more about the root of my aversion to using NFS. It's less to do with the protocol shortcomings and more to do with, as with oh so very many other things, my tendency to see it as a database.



This has been the AWS Morning Briefs: Whiteboard Confessional. I'm Cloud Economist Corey Quinn, and I'll talk to you next week.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 24 Apr 2020 03:00:00 -0700
AWS Billing System Go BRRRRRR
AWS Morning Brief for the week of April 20, 2020.
Mon, 20 Apr 2020 03:00:00 -0700
Whiteboard Confessional: The 15-Person Startup with 700 Microservices: A Cautionary Tale

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links


Transcript


Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.



Today, I want to rant about microservices. What are microservices you may very well ask? the broken answer to a misunderstood question. Let’s Talk about what gave rise to microservices: specifically monoliths. that is generally what predated microservices as a design pattern. And by monoliths, I mean one giant codebase, the way that grandpa used to build things. back then Git wasn’t a thing. Subversion and Perforce ruled the day, and everyone wore a pair of fighting trousers to work in the morning. The problem with monoliths was that it’s challenging in the extreme, culturally, to have a whole host of developers working on the same codebase. one person’s change can inadvertently break the build for the other 5000 engineers all working on that same codebase, and with the various version control systems that were heavily in use before Git became usable by mere mortals. There weren’t a lot of workflows that made it easy to have multiple people collaborate on the same system.



So microservices, for that and a few other reasons, became suggested as a way of solving for this problem. breaking apart those ancient monoliths into functional microservices where each item does one thing began to make a lot of sense. And it solves for a political problem super neatly. You want each team responsible for a given microservice, And that promise is compelling. Because in theory, if you think about this, if you build a microservice and you publish what data that service takes in and in what format it needs to be, and what it will return in response to having that data sent to it, then what your microservice does to achieve its goal and how it works doesn’t matter at all to anyone else. You can replace the database, you can move to serverless, or containers, or punch cards. It doesn’t really matter how it does the work, just so long as the work gets done in the way that is published. And whatever decisions you make, only ever impact your own team as far as how those things get done. So the sky’s the limit, you don’t have to really focus on collaborating with folks the way that you once had. As long as that API remains stable, the sky remains the limit. So suddenly, your 5000 developers are now able to not be tightly coupled to one another and can do all kinds of things and move way faster. It’s a great model. But it’s also a problem. Why?



In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.



The problem is that this works super well for places with thousands of engineers all working on one product. But then Hacker News got a hold of it, And because it’s Hacker News, it took exactly the wrong lessons from this approach and instead started embracing this pattern where it was patently ridiculous, rather than helpful. Now, as a result of this, you see startups out there with 700 microservices, but somehow only 15 employees. Where’s the sense in that? As a result, you have massive sprawl, an awful lot of duplicate work, and no real internal standards in many cases. And as a result, this misguided belief that everything should look architecturally like Google’s internal systems, Despite the fact that your entire application could run on a single computer from 2012 without breaking a sweat. Even Google would argue that its internal systems should not look like Google’s internal systems, but technical debt is a thing and the choices we make constrain what we’re able to do. The problem here is that microservices fundamentally are tied to solving a political problem by a technical means. It’s about how to make people work more effectively. And this somehow instead became a technical best practice. It’s not, not for everyone. It introduces complexity. It leads to scenarios where no one, absolutely no one has all of the various moving parts in their head. It’s prime material for here on the whiteboard confessional Because when you truly embrace microservices, your whiteboards are always full of things that nobody understands. It ties into the larger problem of building things in service to your own resume, or to your engineering department or to the world practice of engineering as some abstract ideal, rather than in service to The business needs that your company exists entirely to cater to. Plus, when you have oodles of microservices hanging around everywhere, First you have a library problem of which service does what. no one knows, And if you ever track it down, great, each one of those services also needs to be monitored. Some monitoring vendor somewhere just wet themselves at the idea of the new boat they’re going to be able to afford now because, with that many things to monitor, it gets really expensive really quickly. You also have testing problems, because testing is critical, And in most shops, the testing quality between different team’s solutions is woefully uneven. There are library problems where people have to not just solve hard problems, but solve them the same way globally. An easy gimme example of that is time zones. How you determine time zones has got to match how other services you deal with, interact with time zones, or you’re gonna wind up with a giant obnoxious problem that is very difficult to track down as time zones change and get updated on one microservice, but not the other. Anyone who’s been through that will absolutely scream at you because you’re never supposed to do that. It leaves scars that last a lifetime.



And the real problem that I don’t see addressed very much is that embracing the microservices model means that you’re committing to an incredibly complicated architecture, regardless of the reasons. And simplicity is always easier to scale, understand and improve. I’m not saying that you should never use microservices. In fact, if you’re a fan of this show, you’ll notice that I virtually never say anything definitive here at all, other than use Route 53 as a database. And the reason that I’m not definitive about what you should or should not do in absolutes is because context is key. It always is. Without context, you go round and round and round attempting to solve for problems you will likely never encounter, all in the name of doing what the thought leaders tell you to do. You are not Google. You are not Twitter for pets. You’re something else. Go do what works for you and what makes sense for your business.



I’m Corey Quinn, cloud economist. This has been the Whiteboard Confessional on the AWS Morning Brief, and I’ve enjoyed irritating positively everyone with my thoughts on microservices.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.

Fri, 17 Apr 2020 03:00:00 -0700
Goldilocks and the Three Elastic Beanstalk Consoles
AWS Morning Brief for the week of April 13, 2020.
Mon, 13 Apr 2020 03:00:00 -0700
Whiteboard Confessional: The Rise and Fall of the T-Shaped Engineer

About Corey Quinn


Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links


Transcript


Corey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.



On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.



For a long time now, I’ve been a believer in the idea of the T-shaped engineer. And what I mean by that is that you should be broad across a wide variety of technologies, but deep in one or two very specific areas. So, it looks a bit like a T, or an inverted T, depending upon how you wind up visualizing that. I’m describing this with words. I don’t have a whiteboard in front of me. Use your imagination, you’ll be okay. The point being is that whenever you’re working in a new environment, or on a new problem, having a broad base of technologies of which you’re aware, is incredibly useful to fall back upon. Now, the reason to be super deep in one or two areas, is that specialization is generally what lets people charge more for various services. People want to hire domain-specific expertise for an awful lot of problems that they want to get solved. So, having something that you can bring into job interviews and more or less mop the floor with people asking questions around that domain is an incredibly valuable thing to have.



But that has some other consequences too. And that’s what today’s episode of The Whiteboard Confessional is talking about. Back in my first Unix admin job, I busily began upgrading a whole lot of the infrastructure and ripping out very early Red Hat Enterprise Linux and CentOS version 4 systems and replacing them with the one true operating system, which, of course, is FreeBSD. And I had a litany of explanation as to why it was the best option, what it could do for various problems, and why there was just absolutely no comparison between FreeBSD and anything else. I could justify it super easily, and the real defense mechanism here was that people get really, really, really tired of talking to zealots, so no one kept questioning me. They just basically said, “Fine, whatever,” and got out of the way. Years later, I decided to focus on something that wasn’t an esoteric operating system to go super deep in, and that’s right, I picked SaltStack, which is configuration management done right, tied to remote execution.



I’d worked with Puppet, I’d tolerated CFEngine, but I had a bunch of angry loud opinions about it and SaltStack was absolutely the way and the light. So, in the company I was working at at that time, I rolled it out everywhere, and our entire provisioning and configuration management process was run through SaltStack. And I could come up with a whole litany of reasons why this was the right answer, and that no one else was going to be able to come close to what the ideal correctness that SaltStack provided. And people eventually stopped arguing with me, because they had better things to do than argue with a zealot about which configuration management system was the right one to go with. I’ve also talked on previous episodes of the show about using ClusterSSH. And this was before I discovered the religion that was configuration management.



It was the right answer, because rather than having to run a for loop with shell scripting, which was suboptimal for a wide variety of reasons, and I would explain to everyone why it was suboptimal. So, again, they shrugged, got out of the way and let me use ClusterSSH. And a similar pattern happened when I was working with large scale storage. NetApp was the right answer for all of our enterprise storage needs because let’s face it, it wasn’t my money. And when it comes to NFS, even today, they are head and shoulders above anything else in the space. And then eventually, it turned to AWS. And for a while, I want to say around 2014, 2015, I would tell you why AWS was the right answer for every problem. What challenge are you trying to work with? Well, AWS has an answer for that, because of course, they do. Their product strategy is, “yes”. Now, what do all of those independent stories have in common? Great question. Let’s talk about that. But first…



In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.



The problem is, is that everything I just mentioned, was a pet technology, or a pet company. Something that I had taken the time to get deep in and learn. And therefore, it became my de facto answer for anything that even remotely looked like that problem. It’s like that old adage, “When the only tool you have is an axe, every problem looks like hours of fun, regardless of whether hitting something with a piece of metal that’s sharpened on one side is actually what you’re looking to do right now.” The problem with FreeBSD, for example, was that, terrific, yeah, it served an awful lot of valuable purposes, but no one else in the team knew how it worked.



So, when I left that job, they had to rip it out and replace it. And that was not easy or fun, because I was about at good as documenting back then as you’d probably think I would be. The problem with SaltStack is that, to some extent, not everyone was on board with the idea of that particular tool. So, people were trying to use Ansible for some things Puppet for something else, and then SaltStack for everything else. So, you wind up with dueling configuration management systems, ClusterSSH is wrong. It’s not the right answer for anything. The problem was shell scripting wasn’t shell scripting itself, but that I was really bad at shell scripting.



But, by casting my own shortcomings as this universal truth, it wound up very quickly becoming this accepted truism that I’m still stamping out and I hear about today. NetApps are great technology, but centralized storage is not really the right answer, and if it is, you’re probably looking at an object store in 2020. And when it comes to AWS, well, I’ll be honest with you, I don’t really have a strong cloud vendor preference. My default advice is, go with the one you’re currently using. I’m most familiar with AWS, but that doesn’t mean that it’s the right answer or wrong answer for your specific tool or your specific business problem.



The problem, as a result, becomes that a single person, in all these stories that was me, they champion this technology, and then they steamroll everyone else in the group until people get tired of arguing and let them do what they want. Or worse, they outright don’t even ask, they just go ahead and do it. And, “Surprise, we’re a FreeBSD shop now.” “Well, that doesn’t sound ideal. I don’t know if we want to be a FreeBSD shop.” “Well, yes, we’ve already deployed it to physical servers that are currently in 18 different buildings, so if you want to go rip it out and replace it with our special standard Linux distribution, you go on right ahead.” It doesn’t work. And I encourage you at this point, and this is where I start winning friends, look around your environment and see if this has happened to you.



The most obvious example of this in the modern zeitgeist is Kubernetes. Take a look at your current Kubernetes environment and ask yourself a seriously tough question. Namely, how many people within your organization need to quit on the spot before you have a serious operational problem tied to the care and feeding of Kubernetes? In some very interesting shops, the answer is one. In some cases, the answer is zero because they already have an operational problem on top of Kubernetes, where an outside consultancy or someone who is no longer there has handled the configuration and care and feeding of their environment, and now it’s more or less become oral tradition of how you work with the existing Kubernetes install. The people don’t tend to understand it. People don’t tend to necessarily understand why things are the way that they are. And Heaven forbid it comes time to fix it.



I’m also not picking specifically on Kubernetes. Serverless has the exact same problems. I’ve wound up in a serverless rat’s nest, trying to figure out what’s going on, where, how or why it was built, and that’s in my own environment where I’m the only person that builds all of these things. It turns out that the worst developer I’ve ever met is me three years ago when I have to go back and resurrect that monster’s code. It’s terrible. It’s awful. And I have to imagine that I will be saying the same thing about present-day me three years from now. I’m not saying don’t use these technologies. I’m not saying don’t enjoy the technologies that you’re using. But remember, you don’t work for the vendor in question. You aren’t being paid, most of us, by AWS to wind up championing what they do. And if something else emerges rather than your chosen technology, keep an open mind. Don’t crap on things just because it’s different. Don’t be down on technologies, just because it’s not what you already know, give it a fair shake, then crap on it from a position of knowing why it’s terrible.



Most of the things that I make fun of in the various snark analyses that I do are around things that I think are terrible because I tried to use them, and I see where the sharp edges are. Making fun of something from a position of ignorance is both super easy and super fraught because it’s very easy to be told, you don’t actually know what you’re doing, and come to find out, you’ve been inadvertently making fun of a whole lot of other people’s hard work. And that’s not a great position to be in, and it’s not fair to them. So, my ask to you, as you think about this, is remember, there are other people who also have their strong angry loud opinions. Do you want to be subject to what they unilaterally decide? I’d suspect not. A little empathy goes a long way, particularly in these times where virtually all communication is not face-to-face.



Thanks for listening. I’m Cloud Economist Corey Quinn. This is the AWS Morning Brief: Whiteboard Confessional.



Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



Announcer: This has been a HumblePod production. Stay humble.


Fri, 10 Apr 2020 03:00:00 -0700
Amazon Detective and the Case of the Giant AWS Bill
AWS Morning Brief for the week of April 6, 2020.
Mon, 06 Apr 2020 03:00:00 -0700
Whiteboard Confessional: My Metaphor-Spewing Poet Boss & Why I Don’t Like Amazon ElastiCache for Redis

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Links


Transcript

Corey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.

On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.

When you walk through an airport—assuming that people still go to airports in the state of pandemic in which we live—you’ll see billboards saying, “I love my slow database, says no one ever.” This is an ad for Redis. And the unspoken implication is that everyone loves Redis. I do not. In honor of the recent release of Global DataStore for Amazon ElastiCache for Redis. Today I’d like to talk about that time ElastiCache for Redis helped cause an outage that led to drama. This was a few years back and I worked at a B2B company—B2B of course, meaning business-to-business. We were not dealing direct-to-consumer—I was a different person then, and it was a different time, specifically, the time was late one Sunday evening, and my phone rang. This was atypical because most people didn’t have that phone number. At this stage of my life, my default answer when my phone rang was, “Sorry, you have the wrong number.” If I wanted phone calls, I’d have taken out a personals ad. Even worse when I answered the call, it was work. Because I ran the ops team, I was pretty judicious in turning off alerts for anything that wasn’t actively harming folks. If it wasn’t immediately actionable and causing trouble, then there was almost certainly an opportunity to be able to fix it later during business hours. So, the list of things that could wake me up was pretty small. As a result, this was the first time that I had been called out of hours during my tenure at this company, despite having spent over six months there at this point, so who could possibly be on the phone but my spineless coward of a boss? A man who spoke only in metaphor, we certainly weren’t social friends because who can be friends with a person like that?

“What can I do for you?” “As the roses turn their faces to the sun, so my attention turned to a call from our CEO. There’s an incident.” My response was along the lines of, “I’m not sure what’s wrong with you, but I’m sure it’s got a long name, it is incredibly expensive to fix.” Then I hung up on him and dialed into the conference bridge. It seemed that a customer had attempted to log into our website recently and had gotten an error page, and this was causing some consternation. Now, if you’re used to a B2C or business-to-consumer environment, that sounds a bit nutty because you’ll potentially have millions of customers. If one person hits an error page, that’s not CEO level of engagement. One person getting that error is, sure it’s still not great, but it’s not the end of the world. I mean, Netflix doesn’t have an all hands on deck disaster meeting when one person has to restart a stream. In our case, though, we didn’t have millions of customers, we had about five and they were all very large businesses. So, when they said jump, we were already mid-air. I’m going to skip past the rest of that phone call in the evening because it’s much more instructive to talk about this with the clarity lent by the sober light of day the following morning. And the post mortem meeting that resulted from it. So, let’s talk about that. After this message from our sponsor.



In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.



So, in hindsight, what happens makes sense, but at the time when you’re going through an incident, everything’s cloudy, you’re getting conflicting information. And it’s challenging to figure out exactly what the heck happened. As it turns out, there were several contributing factors, specifically four of them. And here’s the gist of what those four were.



Number one, we used Amazon ElastiCache for Redis. Really, we were kind of asking for trouble. Two, as tends to happen with managed services like this, there was a maintenance event that Amazon emailed us about. Given that we weren’t completely irresponsible, we braved the deluge of marketing to that email address, and I’d caught this and scheduled it in the maintenance calendar. In fact, we specifically were allowed to schedule when that maintenance took place. So, we scheduled it for a weekend. In hindsight: mistake. When you’re having maintenances like this happen, you want to make sure that they take place when there are people around to keep an eye on things.



Three, the maintenance was supposed to be invisible. The way that Amazon ElastiCache for Redis works is you have clusters, and you have a primary and you have a replica. The way that they do maintenances is they wind up updating the replica half of the cluster, they then fail the cluster over so the replica gets promoted to primary, then they update the old primary, which then hangs out as the replica. This had happened successfully, or so we thought, the day before on Saturday, a full day before our customer got the error page that started this exercise. What had really happened was that we’d misconfigured the application to point to the actual primary cluster member rather than the floating endpoint that always redirects to the current primary within that cluster. So, when the maintenance hit, and the primary then became the replica, we were suddenly unknowingly having the application talk to an instance that was read-only. So, it would still work for anything that was read based. It wasn’t until it tried to write something that all kinds of problems arose.



And that led to a contributing factor four. Because reads still worked, our standard monitoring didn’t pick this up. We didn’t have a synthetic test that simulated a login. As a result, the first indication that something was even slightly amiss showed up in the logs when the customer got that failed page 15 minutes before my metaphor-spewing poet boss called me. So, when explaining this to the business stakeholders during the post mortem, we got to educate them in the art of blamelessness which you’d think would be a terrific opportunity for someone who’s only real skill is spewing metaphor, but of course, he didn’t decide to step up to that plate. Again terrible boss. So, someone from the product org was sitting there saying, “What you’re telling me is that someone on your team misconfigured—” Okay, slow down Hasty Pudding. We’re not blaming anyone for this. There were contributing factors, not a root cause. And this is fundamentally a learning opportunity with a lot of areas for improvement. “Okay, so some unnamed engineer screwed up and—” And we went round and round. Normally, an effective boss would have stepped in here, but remember, he only spoke in metaphor. Defending his staff wasn’t speaking in metaphor, so he, of course, chose to remain silent. As it turns out, and as anyone who knows me can attest, I have a few different skills, but a skill that I’m terrible at is shutting up. It turns out that blameless post mortems are only blameless if you have that culture driven from the top because everyone roundly agreed at the end of that meeting, that the way that it devolved was certainly my fault.



Now, let’s be clear. This was my team that was responsible for the care and feeding configuration of this application. And therefore it was my responsibility. Who had misconfigured it is not the relevant part of the story. And even now, I still maintain that it’s not. There were a number of mistakes that were made across the board, but the buck does stop with me. And there was a chain of events that led to this outage. Our monitoring was insufficient for something this sensitive, an error like that in the logs should have paged me before I got a walking metaphor calling me manually, we should have been testing that whole login flow with synthetic tests, and we should ideally have caught the misconfiguration of pointing the application to the cluster member rather than the cluster itself. But really, the biggest mistake we made across the board was almost certainly using Amazon ElastiCache for Redis. How using something else would have avoided this, I couldn’t possibly begin to say, but when in doubt, as is always a best practice, blame Amazon.

This has been another episode of the Whiteboard Confessional. I hope you’ve enjoyed this podcast. If so, please leave a five-statr review on Apple Podcasts. If you’ve hated this podcast, please leave a five-star review on Apple Podcasts while remembering to check your Redis cluster endpoints.



Announcer: Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.

Announcer: This has been a HumblePod production. Stay humble.

Fri, 03 Apr 2020 03:00:00 -0700
The "AWS For God's Sake Leave Me Alone" Service
AWS Morning Brief for the week of March 30, 2020.
Mon, 30 Mar 2020 03:00:00 -0700
Whiteboard Confessional: Console Recorder: The Thing AWS Should Have Built

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Links


Transcript

Corey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory” because invariably, whatever you’ve built works in the theory, but not in production. Let’s get to it.

On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.

You’ll notice that I periodically refer to various friends of the show as code terrorists. It’s never been explained, until now, exactly what that means and why I use that term. So, today, I thought I’d go ahead and help shine a little bit of light on that. One of the folks that I refer to most frequently is Ian Mckay, a fine gentleman based out of Australia. And he’s built something that is both amazing and terrible all at the same time, called Console Recorder. But let me back up before I dive into this monstrosity. Let’s talk about how we get things that we build in AWS into production. There are basically four tiers. Tier one is using the AWS web console itself, we click around and we build things. Great. Tier two is we use CloudFormation like sensible folks. Tier three is Terraform with all of its various attendant tooling, and then there’s the ultra tier four that I do, which is we use the AWS console and then we lie about it. Some folks are gonna play around here and say that oh, you should use the CDK, or something like that, or custom scripts that wind up spinning up production.

And all of those are well and good, but only recently did CloudFormation release the ability to import existing resources. And even then, much like Terraform import, it’s pretty gnarly and not at all terrific. So, what do you wind up generally doing? Well, if you’re like me, you’ll stand up production resources inside of an AWS account. You will click around in the console—I always start with the console, not because I don’t know how these other tools work, that’s a side point, but rather because that helps me get a sense for how these services are imagined by the teams building them. They tend to assume that everyone who interacts with them is going to go through the console at some point, or at least they should. So, it gives me access and exposure to what their vision of this service is. Then once you’ve built something up, it often finds its way into production, if you at all like me, where I’ll spin something up just to test something and it works, and oh my stars, and suddenly you just want to get it out and not worry about it, so you don’t go back and rebuild it properly.

So, now you’re left with this hand-built thing that’s just flapping around out there in production. What are you supposed to do? Well, according to the AWS experts, if we’re allowed to use that term to describe them, you’re supposed to smack yourself on the forehead, be convinced that you’re fundamentally missing the boat here, throw everything you’ve just built away and go back and do it properly. Which admittedly seems a little bit on the nose for those of us who’ve done exactly this far more times over the course of our career than we would care to count. So, today, however, I posit that there’s an alternate approach that doesn’t require support from AWS, which, to be honest, long ago seems to have given up on solving this particular problem in a way that human beings can understand. And I’d like to tell you about that, after this brief message from our sponsor.



In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.



This brings us back to where we started this conversation, with defining what a code terrorist was and pointing out Ian Mckay, out of Australia, who’s built something absolutely terrifying called Console Recorder for AWS. And this is a browser extension that works in Chrome, it works in Firefox, possibly others I stopped looking after those two. And what it does is you click a button in your browser when you start doing something inside the AWS console, and it does exactly what it says it would on the tin, where it starts recording what you’re doing. Their icon turns to a bright red record symbol, and you do whatever it is you’re going to do to spin something up. When you’re done, you hit the button again and say stop recording, and it opens a brand new tab.

Inside of this tab lies wonder and magic that I am thrilled to be able to share with you. It’s terrifying, I mean, he is a code terrorist, but it is transformative. Specifically, it winds up spitting out the programmatic way of doing whatever the hell it is that you just did, but it spits that out in Boto 3, which is the Python SDK, it spits it out in CloudFormation, in Terraform, in Troposphere, which is a Python library built on top of CloudFormation—don’t get me started, it has its own challenges—CDK, for folks who can’t stop talking about the CDK long enough to build anything. And for those of us who tend to use our bash scripts, that’s right, it’s also got support for the AWS CLI. All of these are within sub-tabs within this page. And in all of them, it uses that specific language to explain what syntax and what commands you use to spit out those resources correctly, not at the high level of, oh, just do this magic. You could rerun these commands and spit out exactly what you had just done in the console. Oh, it also supports the JavaScript SDK for this. And lastly, the Go SDK, which is, in parentheses, version one. Now understand that Go is, of course, a project out of Google, so there’s always two versions of it. The version that they’re working on, and the version that is already deprecated.

So, you wind up with a whole bunch of different options for ways to express that configuration of things you just clicked around inside of the console. But wait, there’s one more tab I want to talk to you about. IAM. Specifically, it defines very granular permissions for what you just did. For example, I just used this, before I recorded this, to spin up some S3 buckets. And it gives me very specific granular permissions for list bucket only on the bucket that I created. It does have some duplication because sometimes the console makes the same call several times, so, you’re going to want to edit this before turning it into something, but it’s a matter of taking things away, not necessarily of putting them back in. It’s always easier to cut than it is to add when it comes to overly wordy syntactic boilerplate. So, the reason that I’m bringing this up on the Whiteboard Confessional is because this is the sort of thing that A) should not have to exist, but it does. B) this is something that AWS should have built or something like this a long time ago of, it’s catching every API call you’re making whenever you build anything through the console, or through some weird script, or whatever tool you care to use, there is no reason in the world that they shouldn’t be able to capture those things and go ahead and tell you, oh, here’s how you would do that programmatically.

I don’t mean to turn this into a cloud comparison story, but GCP already does this in the console. Hey, that thing you just did, here’s the command line interface if you were to do that right here in Cloud Shell in the browser. It’s incredibly helpful being able to take the thing you just clicked your way through and turning it into a command-line story that involves a lot of different fuzzy parameters, different syntaxes, different API syntaxes because of course there’s no API consistently here. That was a bit of a long story there, wasn’t it? And see, it really does tend to show the certain, I guess, depressing lengths that we’ll go to, in order to be able to A) build things and B) then replicate them in a borderline production-like environment. I should never be talking about this other than for this quick one-off thing that I built as an experiment. But I find myself using this more and more and more. That’s kind of a problem because it speaks to the fact that whenever I do something, I’m usually not the only person doing it—unless we’re talking about Route 53 is the database—it tells me that getting infrastructure as code is still too heavy of a lift for an awful lot of folks out there.

The fact that this is not an extension that I dreamed up, or I wasn’t the only person that Ian told. I heard about this through the community. The fact that this has to exist, it speaks to the relatively immature state of infrastructure as code for production use in environments where you may not be starting out that way. It seems like AWS and the industry at large divide people into two buckets, one, uncultured swine like me who build things in the console, or the second bucket, very large companies that already have a robust CI/CD infrastructure pipeline, where they already have all of the tooling needed to build these things safely and sanely in production. In practice, most folks are somewhere between those two extremes. And the fact that it took a random third party building something like this before CloudFormation import even existed, tells me that this is a very clear pain for a lot of people. I talked to Ian on the Screaming in the Cloud podcast. It turns out that building this entire monstrosity took him a couple of months of nights and weekends work, and he did it by himself. And it’s disturbingly well polished for something like that. If he can build this in his spare time, why have we been asking for and not getting this for the entirety of AWS’s lifetime as a cloud services provider? That’s my rant. That’s my position on this. And frankly, I’m ashamed to admit that I love and use this service. Take a look. You can Google Console Recorder for AWS, the source is on GitHub. You need not trust Ian or me for this, but give it a try. For some of you out there, it’ll be transformative.



Thank you for listening to the AWS Morning Brief Whiteboard Confessional. I’m Cloud Economist Corey Quinn, and I’ll talk to you next week.



Announcer: Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig and let me know what I should talk about next time.



This has been a HumblePod production. Stay humble.

Fri, 27 Mar 2020 03:00:00 -0700
Watch Your Bill or They'll CloudWatch It For You
AWS Morning Brief for the week of March 23, 2020.
Mon, 23 Mar 2020 03:00:00 -0700
Whiteboard Confessional: Configuration MisManagement

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Show Notes


Transcript

Corey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.


On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.


Historically, many best practices were, in fact, best practices. But over time, the way that we engage with systems changes. The problems that we’re trying to solve for start resembling other problems. And, at some point entire industries shift. So, what you should have been doing five years ago is not necessarily what you should be doing today. Today, I’d like to talk a little bit about not one or two edge case problems, as I have in previous editions of the Whiteboard Confessional, but rather, I want to talk about an overall pattern that’s shifted. And that shift has been surprisingly sudden, yet gradual enough that you may not entirely have noticed. This goes back into, let’s say 2012, 2013, and is in some ways the story of how I learned to speak publicly. So this is indirectly one of the origin stories of me as a podcaster, and continuing to engage my ongoing love affair with the sound of my own voice. I was one of the very early developers behind SaltStack. Salt, for those who are unfamiliar, is a remote execution framework slash configuration management system that let me participate in code development. It turns out that when you have a pattern of merging every random pull request that some jackass winds up submitting, and then immediately submitting a follow up pull request that fixes everything you just merged in, it’s, first, not the most scalable thing in the world, but on balance provides such a wonderful welcoming community, that people become addicted to participating in it. And SaltStack nailed this in the early days.


Now, before this, I’d been focusing on configuration management in a variety of different ways. Some of the very early answers for this were CFEngine, which was written by an academic and is exactly what you would expect an academic to write. It feels more theoretical than it does something that you would want to use in production. But okay, Bcfg2 was something else in this space, and the fact that that is its name tells you everything you need to know about how user-friendly that was. And then the world shifted. We saw Puppet and Chef both arise. You can argue which came first, I don’t care enough in 2020 to have that conversation. But they wound up promoting a model of a domain-specific language, in Puppet’s case, versus chef where they decided, “All right, great, we’re gonna build this stuff out in Ruby.” From there, we then saw a further evolution of Ansible and SaltStack, which really round out the top four. Now, all of these systems fundamentally do the same thing, which is how do we wind up making the current state of a given system look like it should? That means, how do you make sure that certain packages are installed across all of your fleet? How do you make sure that the right users exist across your entire fleet? How do you guarantee that there are files in place, that have the right contents? And when the contents of those files change, how do you restart services? Effectively, how do you run arbitrary commands and converge the state of a remote system so it looks like it should? Because trying to manage systems at scale is awful.


You heard in a previous week what happened when I tried to run this sort of system by using a Distributed SSH client. Yeah, it turns out that mistakes are huge and hard to fix. This speaks toward the direction of moving into cattle instead of pets when it comes to managing systems. And all of these systems more or less took a different approach to it. And some were more aligned with how I saw the world than others did. So I started speaking about SaltStack back in 2012 and giving conference talks. The secret to giving a good conference talk, of course, is to give a whole bunch of really terrible ones first, and woo boy were these awful. I would put documentation on the slides. I would then read the documentation to people frantically trying to teach folks the ins and outs of a technical system in 45 minutes or less. It was about as engaging as it probably sounds like. Over time, I learned not to do that, but because no one else was speaking about SaltStack I was sort of in a rarefied position of being able to tell a story, and learn to tell stories, about a platform that I was passionate about, as it engaged a larger and larger community. Now, why am I talking about all of this on the Whiteboard Confessional? Excellent question.


But first, in the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.


The reason I bring up configuration management across the board is not because I want to talk about the pattern of doing terrible things within it, and oh, there are terrible patterns, but rather to talk about how in 2020, it is almost certainly an anti-pattern if you’re using it at all. Let me explain. The idea of configuration management was how you wind up getting a bunch of systems converged to a common state. Well, in the world of immutable infrastructure, in no small part ushered in by Docker, and later other systems like it, suddenly spinning a new container took less than a second, and it was exactly as you’d want it to be. So one-line configuration changes didn’t require redeploying fleets of servers, just iterating forward to the next version of the container. That, in turn, became a much better approach, and then you could get into a world where there wasn’t even an SSH daemon running, so people could not get into these containers in the first place to muck around and break the config, which was a big piece of why you had to have these things running continuously.


And over time, as we saw this pattern continued to emerge, the number of systems and use cases for which having something stateful that was managing this became fewer and fewer. Now, with the rise of cloud providers being what they are, you wind up having a slightly different problem, where, if I’m trying to provision a whole bunch of EC2 instances to do things, well, first, I can change the AMI, and we’ve been able to do that for 15 years, but config management was still a great plan back then. Now with things like auto-scaling groups, with containers running on top of these systems, you’re getting into a place where you’re using something like Chef for initial configuration of an instance, maybe, at best, but you’re not going to be running this in a persistent way. where it’s constantly checking in with a central control system anymore. In fact, an awful lot of the things that Chef historically would do are now being handled by user data, where you give a system a script to run on the first time that it boots. Once that script completes successfully, it reports into a load balancer whatnot as being ready to go. But even that is something of an anti-pattern now because historically, we had the problem of, “Okay, I’m going to have my golden image, my AMI that spins up everything that I’d want, then I’m going to hand off to Chef. So you would see these instances being spun up, and then it would take 20 minutes or so for it to run through all the various Chef recipes and pass the health checks.


That’s great, but if you need that instance to be ready sooner than that, that’s not really a great pattern. That, in turn, has led to people who are specialized in the configuration management space, feeling the ground beneath their feet start to erode. If you look at any seriously scaled out system, there’s a bit of that in there, but most of the stuff that matters, most of the things break systems at scale are being run through things like Terraform instead, where you’re provisioning exactly what needs to be there up front, letting your cloud provider get the right AMI or baseline image into production, and from there, everything else being handled by things like [etcd 00:10:22], or by which version of a container they’re running to service a particular service. Now, I don’t agree with all of the steps that people have taken down this path, in fact, I make fun of an awful lot on this and other shows, but the problem that we’re seeing that these things solve has not gone away, but it’s become much smaller, which means that, eventually, if you’re specializing in configuration management, as I once did, you’re going to take a look around and you’re going to realize that this is not a growth industry in any meaningful sense for these vendors who generally tend to charge on a per-host basis. Sure, there are existing large scale environments that are continuing to expand their usage of these services, but there isn’t nearly as much net new going on there, which means that we’re starting to see what looks like an industry in decline.


One of my common career tips is to imagine that I’m going to start a company tomorrow. And in the next 10 years, it’s going to grow from a tiny startup to a component of the S&P 500. From a tiny company to a giant company. At what point during that evolution over the next 10 years, does that tiny startup become a customer of a particular tool or a particular market segment? If the answer is, it probably doesn’t, then whatever is serving that industry niche is almost certainly not going to survive longer term in its current state. Sure, there’s a giant long tail of existing big e-Enterprise offerings out there that are going to address that in a bunch of interesting and exciting ways, but that’s not something that I would necessarily hang my hat on as a technologist trying to focus on something that is going to be relevant for the next five to 10 years. I, instead, would prefer to focus on something that’s growing. That means today, cloud services? Absolutely. Serverless? Most definitely. Kubernetes? Unfortunately. Docker? Well, sort of. That’s largely not become its own independent skill set anymore. It’s started to slip below the surface of awareness of what people care about, in the same way as almost everything else we talk about on this show eventually will as well, leaving in the realm of a few highly advanced specialists to wind up playing around with them. And that, in short form, is what happened to my once passionate love affair with configuration management, now turning into something I basically don’t even bother to ask people about, because I can relatively safely assume that the answer is not nearly as relevant as it once was, to their technical success.


For the Whiteboard Confessional sub-series of the AWS Morning Brief, I’m Cloud Economist Corey Quinn, and I’ll talk to you next week.


Thank you for joining us on Whiteboard Confessional. If you have terrifying ideas, please reach out to me on twitter at @quinnypig, and let me know what I should talk about next time.


Announcer: This has been a HumblePod production. Stay humble.

Fri, 20 Mar 2020 03:00:00 -0700
The Saddest Kubernetes Hanukkah
AWS Morning Brief for the week of March 16, 2020.
Mon, 16 Mar 2020 03:00:00 -0700
Whiteboard Confessional: Everything's a Database Except SQLite

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Links


Transcript


Corey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semipolite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.

On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.


Many things make fine databases that replicate data from one place to another, that takes various bits of data and puts them where they need to go. Other things do not make fine databases that do such things. Let’s talk about one of those today. For those who have never had the dubious pleasure of working with it, SQLite is a C library that implements a relational database engine. And it’s pretty awesome. It’s very clearly not designed to work in a client-server fashion, but rather to be embedded into existing programs for local use. In practice, that means that if you’re running SQLite, that’s S-Q-L-I-T-E, your database backend is going to be a flat-file or something very much like that, that lives locally.


This is technology used all over the place, and mobile apps and embedded systems, in web apps for some very specific things. But that’s not quite the point. I once worked somewhere that decided to build a replicated environment that was active, active, active, across three distinct data centers. You would really hope that that statement was a non sequitur. It’s not. If you were to picture Hacker News coming to life as a person, and that person decided to design a replication model for a database from first principles, you would be pretty close to what I have seen. By taking a replicated model that runs on top of SQLite, you can get this to work, but the only way to handle that—because there’s no concept of client-server, as mentioned—so you have to kick all of the replication and state logic from the database layer, where it belongs up, into the application code itself, where it most assuredly does not belong. The downside of this—well, there are many downsides, but let’s start with a big one that this is not even slightly what SQLite was designed to do at all.


However, take a startup that decides if there’s one core competency they have, it’s knowing better than everyone else; this is that story. Now, I am obviously not a developer, and I’m certainly not a database administrator. I was an ops person, which means that a lot of the joy of various development decisions fell to whatever group I happened to be in at that point in time. It turns out that when you run replicated SQLite as a database, that you have to get around an awful lot of architectural pain points by babying this thing something fierce. There are a number of operational problems that going down a path like this will expose. Let me explain what some of them look like, after this.


In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.


I’m not going to engage in a point-by-point teardown of this replicated SQLite as primary datastore Eldritch Horror. My favorite database personally remains Route 53, and even that’s a better plan than this monstrosity. I’m not going to tackle point-by-point, everything that made this horrifying thing, come to life, so awful to deal with. Anyone who runs this at any sort of scale for more than a week is going to discover a lot of these on their own. But I am going to cherry-pick a few things that were problematic about it. Remember back in the days of Windows, when things would get slow and crappy, and you had to basically restart your machine while the disk defragmented forever? Yeah, it turns out that most database systems have the same problem. The difference is, is that reasonable adult-level database systems that have human beings who are used to how this stuff works, tend to put that underneath the hood, so you don’t really have to think about this.


With SQLite, it wasn’t really designed for this sort of use case. So you get to wind up playing these games yourself, which is just an absolute pleasure and a joy, except the exact opposite of that. Which means that every node periodically has to be taken down in a rotation after, in our case about a week or so, or it would start chewing disk, it would take forever to start returning the results to some queries, and the performance of the entire site would wind up slamming to a halt. So, you have to make people aware that this exists. When we first discovered that it was fun. The problem here is that what you’re doing is speaking to a larger problematic pattern. Namely, you’re forcing what has historically been a low-level function that even most operations people don’t need to know or care about, into something that is now at the forefront of every developer’s mental model of the application. And if they forget that this is one of the things that has to happen, woe be unto them. Further, it should be pretty freakin’ obvious by now, by everything I’ve described about this monstrosity, that this company’s core competencies/business problem that it was solving was not building database engines. They were a classic CRUD app that solved for line-of-business problems.


This is a perfect story for a traditional relational database. Why on earth would you need to reinvent an entire database engine to solve that one relatively solved business problem? A sensible person would surmise that you, in fact, do not need to do such a thing. This was not a decision that was made by sensible people. So, assume that, at this point, you have gone way past the rails here. You are past the Rubicon, you are off the track, and you’ve built such a thing. Now assume that you’ve run into edge cases running it. Now, let me be clear. If you choose such an architecture, your entire life is going to be edge cases, if for no other reason, then this is almost certainly not the only poor decision you’ve made. But assuming that it is, every problem you hit is going to be an exercise in frustration. You don’t get to take advantage of the community effects of virtually every other datastore option on the planet. Whereas you can post on various Slack teams, on Twitter, on forums, on GitHub, etc. If you try that with something like this piece of nonsense, the answer is going to be a screaming, “What the hell have you built?” in response. At which point you are oh, so very much on your own.


Now, you might think that this episode is just me dunking on a previous crappy employer and an internal system that is never going to make it into the light of day anywhere else. Well, fun coda to this story. They open-sourced this monstrosity. You can go and look at all of this code if you know where to look. And no, I’m not going to tell you. You can find it on your own if you need this nonsense. It is 10,000 lines of C code, written on top of the SQLite library. When this was announced on Hacker News, Hacker News found it too Hacker News for their own liking and tore it to pieces in the comments. The authors of SQLite itself took one look, immediately renounced God, and went to go live lives of repentance away from the rest of humanity, which is a shame because none of this is their fault. But it does go to show that whatever wonderful thing you build and release into the world, someone will take it and turn it into something that has no business existing on God’s green earth. If you really care about what shop this came out of, you can find it if you look. I am not going to name and shame a startup. They are not a giant public multinational, like Google, or AWS, or Oracle. So, I don’t feel right dragging their name in public. The service that they build is awesome. Their architectural decisions and their team culture, honestly, were both terrible. I’ll let them out themselves should they choose to do so, but that’s not the point of this. The point of this episode is that there are oh so many worse things, to use as a database than Route 53. Thank you for listening to the Whiteboard Confessional. At least this time, it wasn’t entirely my fault.


Announcer: This has been a HumblePod production.


Stay humble.

Fri, 13 Mar 2020 03:00:00 -0700
Nothing’s Certain but Death and Distinguished Engineers
AWS Morning Brief for the week of March 9, 2020.
Mon, 09 Mar 2020 03:00:00 -0700
Whiteboard Confessional: Scaling Databases in a Single Bound

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Links


Transcript


Corey: Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semipolite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.


But first… On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.


So I’m going to deviate slightly from the format that I’ve established so far on these Friday morning whiteboard confessional stories, and talk instead about a pattern that has tripped me and others up more times than I care to remember. So it’s my naive hope that by venting about this for the next 10 minutes or so, I will eventually be able to encounter an environment where someone hasn’t made this particular mistake. And what mistake am I talking about? Well, as with so many terrifying architectural patterns, it goes back to databases. You decide that you’re going to write a small toy application, You’re probably not going to turn this into anything massive. And in all honesty, baby seals will probably get more hits than whatever application you’re about to build will. So you don’t really think too hard about what your database structure is going to look like. You spin up a database, you define the database endpoint inside the application, and you go about your merry way. Now, that’s great. Everything’s relatively happy, and everything we just described will work. But let’s say that you hit that edge or corner case where this app doesn’t fade away into obscurity. In fact, this turns out to have some legs, the thing that you’re building now has attained business viability or is at least seeing enough user traffic that it now has to worry about load.


So you start taking a look at this application because you get the worst possible bug reports six to eight months later; it’s slow. Where do you start looking when something is slow? Well, personally, I start looking at the bar, because that is a terribly obnoxious problem to have to troubleshoot. There are so many different ways that latency can get injected into an application. You discover the person reporting the slowness is on the other side of the world with satellite internet connection that they’re apparently trying to set up to the satellite with a tin can and a piece of very long string. There’s a lot of failure states here that you get to start hunting down. The joys of latency hunting. But in many cases, the answer is going to come down to, oh, that database that you defined is now no longer up to the task. You’re starting to bottleneck on that database. Now, you can generally buy your way out of this problem by scaling up whatever database you’re using. Terrific, great, it turns out that you can just add more hardware, which in a time of cloud, of course, just means more money and a bit of downtime while you scale the thing up, but that gets you a little bit further down the road. Until the cycle begins to rinse and repeat, and it turns out, there are only instances that are so large that you’ll be able to get to power databases. Also, they’re not exactly inexpensive. Now, I would name exact sizes of what those databases might look like. But this is AWS, they’re probably going to release at least five different instance families and sizes, by the time I finish recording this. But it gets published later at the end of the week. So instead, there is an alternative here, and it doesn’t take much from an engineering or design perspective when you’re building out one of these silly toy apps that will never have to scale. What is that fix, you might wonder? Terrific question. Let me tell you in just a minute.


In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.


So this is a pattern that increasingly, modern frameworks are recommending, but a number of them don’t. And I’m not going to name names, because I don’t want to wind up in a slap and tickle fight around which frameworks are good versus which frameworks are crappy. You can all make your own decisions around that. But the pattern that makes sense for this is even when you’re beginning with a toy app, go ahead and define two database endpoints, one for reads, And one for writes. Invariably, this is going to solve a whole host of problems with most database technologies. If you take a look at most applications, and yes, I know there are going to be exceptions to this, they tend to bottleneck on reads. If you have just a single database or database cluster, then all of the read traffic gets in the way of being able to write to that. That includes things that don’t actually need to be in line with the rest of what the application is doing. If you can have a read replica that’s used for business analytics, great. Your internal business teams can beat the living crap out of that database replica without damaging anything that’s in the critical path of serving users. And the writes can then go specifically to the primary node, which is generally where the writes have to happen. Now, yes, depending on your database technology, there’s going to be a whole story around multi-primary architectures, and how that’s going to wind up manifesting. But those tend to be a bit more edge case, and by the time you’re into those sorts of weeds, you know it already.


The point here is that if you look at most applications, they are rebound. So being able to scale from a single primary to a whole bunch of replicas means that you can have those reads hitting a fleet of systems and depending upon replication delays, be getting near real-time results from those nodes, without overburdening the single node that can take writes. You wouldn’t think that this would be that big of a deal when it comes to architectural patterns, but I’ve seen so many different environments and so many applications fall victim to this. Well, it seems like an early optimization, you might say, naively, as I once did, what if we just make that change later when the time comes? Well, by the time an application is sophisticated and dealing with enough traffic to the point where you are swamping the capacity of a modern database server, at that point, you don’t have one or two database queries within your application, there are hundreds or thousands. And because there was never any setup that differentiated between a read endpoint and a write endpoint, a lot of queries tend to assume that they can do both. In some cases in the same query. It means that there’s an awful lot of refactoring pain that’s going to come out of this.


“Well hang on,” you might very reasonably say, “what if you don’t want to spin up twice as many database servers for those crappy toy apps, of which baby seals get more hits?” Great. I’m not suggesting you start spending more money on databases you don’t need. I don’t work for a cloud provider. I am not incentivized to sell you things like that. I would say in that scenario, great, you can have just that single database, because it usually will work. But if you refer to it in the application by two different endpoints that you set as variables at the beginning, one for your read endpoint and one for your write endpoint, it forces good behavior from the beginning and it saves you, in some cases, months of work down the road, trying to refactor this out in ways that are painful, difficult, and worst of all, from this perspective, expensive. Remember, I’m a cloud economist. My entire philosophy is around optimizing cloud spend. This is not something that is going to cost you a lot of money up front. But it is a form of technical debt that you very often don’t realize that you’re dealing with. I wish I could say this was just me looking for a random item from my past to talk about, but it’s not. This is something I have seen again, and again, and again, and again, to the point where I can almost quote chapter and verse of terrifying things that people tell me where this winds up being the root cause. People don’t do it to be malicious, they don’t do it out of ignorance. They do it by, in most cases, just assuming that this app is likely never going to get that big. And they’re right. It won’t. Until one day it does, and then I’m here ranting into a microphone yelling at you about proper database architectures. And given that we’ve already established that my favorite database in the world is Amazon’s Route 53, if I’m lecturing you about database architecture, something has gone very, very, very wrong. And yet, here we are.


Thank you for listening to me rant about proper separation of database endpoints. We’ll go back to real world stories next week. But ideally at this point, we have just saved someone from making a terrible, terrible mistake that they will not realize they’re making for months or years. I’m Cloud Economist Corey Quinn. This is the whiteboard confessional. Thank you for listening.


Announcer: This has been a HumblePod production.


Stay humble.

Fri, 06 Mar 2020 03:00:00 -0800
Amazon Transcribe Gets <REDACTED>
AWS Morning Brief for the week of March 2, 2020.
Mon, 02 Mar 2020 03:00:00 -0800
Whiteboard Confessional: How Cluster SSH Almost Got Me Fired

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Links

Transcript

Corey: On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.


So, once upon a time, way back in the mists of antiquity, was a year called 2006. This is before many folks listening to this podcast were involved in technology. And I admit as well that it is also several decades after other folks listening to this podcast got involved in technology. But that’s not the point of this story. It was my first real job working in anything resembling a production-style environment. I’d dabbled before this, running various environments on Windows desktop style support. I’d played around with small business servers for running Windows-style environments. And then I decided there wasn’t much of a future in technology and spent some time as a technical recruiter, spent a little bit more time working in a sales role, which I was disturbingly good at, but I was selling tape drives to people. But that’s not the interesting part of the story. What is, is that I somehow managed to luck my way into a job interview for a university, helping to run their Linux and Unix systems.


Cool. Turns out that interviewing is a skill like any other. The technical reviewer was out sick that day, and they really liked both the confidence of my answers, as well as my personality. That’s two mistakes right there. One; my personality is exactly what you would expect it to be. And two; hiring the person who sounds the most confident is exactly what you don’t want to do. It also tends to lend credence to people who look exactly like me. So I had converted some systems over in the first few months for that role to FreeBSD, which is like Linux, except it’s not Linux. It’s a Unix and it’s far older, derived from the Berkeley software distribution. and managing a bunch of those systems at scale was a challenge. Now understand, in this era scale meant something radically different than it does today. I had somewhere between 12 and 15 nodes that I had to care about. Some more mail servers. Some were NTP servers, of all things. Utility boxes here and there, the omnipresent web servers that we all dealt with, the Cacti box whose primary job was to get compromised and serve as an attack vector for the rest of the environment, etcetera.


This was a university. Mistakes didn’t necessarily mean the same thing there as they would in revenue-generating engineering activities. I was also young, foolish, and the statute of limitations is almost certainly expired by now. So, running the same command on every box was annoying. This was in the days before configuration management was really a thing. BCFG2 was out there and incredibly complex. And CFEngine was also out there, which required an awful lot of in-depth arcane knowledge that I frankly didn’t have. Remember, I bluffed my way into this job and was learning on the fly. So I did a little digging and, lo and behold, I found a tool that solved my problems. called ClusterSSH. And oh, was it a cluster. The way that this works was that it would spin up different xterm windows on your screen that you could then provide a list of hosts for, and it would open one for every host you gave it.


Great. So now I’m logged into all of those boxes at once. If this is making you cringe already, it probably should, because this is not a great architectural pattern. But here we are, we’re telling this story, so you probably know how that worked out. One of the intricacies of FreeBSD is that instead of running systems that turn things on or turn things off, as far as services to start on boot. For example, with Red Hat derived systems, before the dark times of systemd, you could write things like chkconfig, that’s C-H-K, the word config, and then you could give a service and tell it to turn it on or off at certain run levels. This is how you would tell it to, for example, start the webserver when you boot, otherwise, you reboot the system, the webserver does not start, and you wonder why TCP now terminates on the ground. This was all controlled via a single file—/etc/rc.conf. That controlled which services were allowed to start, as well as which services were going to be started automatically on boot. It would generally be a boolean value provided to the particular service name.


Well, I was trying to do something, probably, I want to say, NTP related, but don’t quote me on that, where I wanted to enable a certain service to start on all of the systems at once. So I typed a command, specifically echoing the exact string that I wanted in quotes, so it would be quoted appropriately, and then with the right angle bracket, to that file—/etc/rc.conf, and then I pressed enter. Now, for those who are unaware of Unix-isms and how things work shell, a single right angle bracket means overwrite this file, two angle brackets say append to the end of this file. I was trying to get the second one, and instead, I wound up getting the first. So suddenly, I had just rewritten all of those files across every server. Great plan, huh? Well, I realized what I’d done as soon as I checked my work to validate that the system had taken the update appropriately, it had not, it had taken something horrifying up instead. What happened next? Great question.


But first, in the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast.


So, I’m suddenly staring at a whole bunch of systems that now have a corrupted configuration. Fortunately, this hadn’t taken anything down, at the moment. And it wouldn’t until one of these systems was restarted. Now, these are Unix boxes, so they don’t tend to get restarted all that often. But it’s got to be fixed and immediately because one, power outages always happen when you least expect them to, and two, leaving a landmine like that for someone else is what we call a career-limiting move in almost every shop, even a university, which is not typically known as a place that’s easy to get fired from. But I could’ve managed if I’d left that lying around. So the trick that I found to fixing all of this was logging into every one of those boxes by hand and taking a look to see what services were currently running on those boxes and then reconstructing what that file should have looked like, which was just an absolute treasure and a joy.


Now well, hang on a second, why didn’t I restore from the backups that were being taken of these systems? What part of “first Unix admin job” are you not hearing? Backups were a thing that were on my list to get to eventually. You get really interested in backups right after you really needed to have backups that were working. Also, it turns out backups are super easy. It’s restores that are difficult and if you can’t restore, you don’t really have a backup. So at the end of going through all of those nodes one by one, over the course of about four hours, I’d managed to successfully reconstruct each of their files. Then what I wound up doing was very carefully restarting each one in sequence during a maintenance window later that afternoon, and validating, once I got in, that they continued to do the things that they were doing. I would compare what was currently running as a process versus what had been running before I restarted them. Suddenly, I’m very diligent about taking backups and keeping an eye on what exactly was running on a particular box. And by the time I got through that rotation, I was a) lot more careful, and b) everything had been restored, and there was no customer-facing impact.


Now, all of that’s a very long story. But what does it have to do with the Whiteboard Confessional? What was the architectural problem here? The problem fundamentally, was that I was managing a fleet, even a small one, of systems effectively by hand. And this sort of mistake is incredibly common when you run the wrong command on the wrong box. There was no programmatic element to it, there was no rollback strategy at all. And there’s a lot of different directions that this could have gone through. For instance, I could have echoed that command first, just from a safety perspective, and validated what it did. I could have backed up the files before making a change to it. I could have tested this on a single machine instead of the entire production fleet. But most relevantly to the architectural discussion here, I could have not used freakin’ ClusterSSH. The problem, of course, is that instead of having a declarative state that you’re defining what your system should look like, you’re saying run this arbitrary command through what’s known as an imperative style of configuration management. This pattern continues to exist today across a wide variety of different systems and different environments. If you take a look at what Ansible does under the hood, this is functionally what it does—any config management system does—it runs a series of commands and drops files in place to make sure a system looks a certain way.


If you’re just telling it to go ahead and run a particular command, like “create a user,” every time that command runs, it’s going to create a new user and you wind up with a whole bunch of users that don’t belong there, that don’t need to exist. Thousands upon thousands of users on a system from one dating back to every time the configuration management system runs. That’s how you open bank accounts at Wells Fargo, not how you intelligently managed systems at significant scale. So, making sure that your systems that are doing configuration management understand a concept of idempotence is absolutely critical. The idea being that I should be able to run the same thing multiple times and it not wind up destroying or duplicating or going around in circles in any meaningful way. That is the big lesson of configuration management. And today, systems that AWS offers, like AWS Systems Manager Session Manager, can have this same problem. The same with their EC2 Instance Connect. You can run a whole bunch of scripts and one-liners on a variety of nodes, but you’ve got to make sure that you test those things. You’ve got to make sure that there’s a rollback. You have to test on a subset of things, or you’re finding yourself recording embarrassing podcasts like this one, years later, once the statute of limitations has expired. No one is born knowing this, and none of these things are intuitively obvious, until the second time. Remember, when you don’t get what you want, you get experience instead, and experience builds character.


I am Cloud Economist Corey Quinn, and I am a character. Thank you for listening to this episode of the AWS Morning Brief Whiteboard Confessional. Please leave a five-star review on iTunes if you’ve enjoyed it. If you didn’t, please leave a five-star review on iTunes via a script that continues to write a five-star review on iTunes every time you run it.


Announcer: This has been a HumblePod production.


Stay humble.


Fri, 28 Feb 2020 03:00:00 -0800
RSA Thinks AWS Firewall Manager is a Job Title
AWS Morning Brief for the week of February 24, 2020.
Mon, 24 Feb 2020 03:00:00 -0800
Whiteboard Confessional: Route 53 DB

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Transcript

Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semipolite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.


But first… On this show. I talk an awful lot about architectural patterns that are horrifying. Let's instead talk for a moment about something that isn't horrifying: CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets and you can access it using API's you've come to know and tolerate through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive discs in triplicate and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at chaossearch.io.


I frequently joke on Twitter about my favorite database being Route 53, which is AWS’s managed database service. It’s a fun joke, to the point where I’ve become Route 53’s de facto technical evangelist. But where did this whole joke come from? It turns out that this started life as an unfortunate architecture that was taken in a terrible direction. Let's go back in time, at this point almost 15 years from the time of this recording, in the year of our Lord 2020. We had a data center that was running a whole bunch of instances—in fact, we had a few data centers, or datas center, depending upon how you chose to pluralize, that’s not the point of this ridiculous story. Instead what we’re going to talk about is what was inside these data centers. In this case, servers.


I know, server-less fans, clutch your pearls, because that was a thing that people had many, many, many years ago. Also known as roughly 2007. And on those servers there was this new technology that was running and was really changing our perspective of how we dealt with systems. I am, of course, referring to the amazing transformative revelation known as virtualization. This solved the problem of computers being bored and not being able to process things in a parallelized fashion—because you didn’t want all of your applications running on all of your systems—by building artificial boundaries between different application containers, for a lack of a better term.


Now in these days, these weren’t applications. These were full-on virtualized operating systems, so you had servers running inside of servers, and this was very early days. Cloud wasn’t really a thing. It was something that was on the horizon, if you’ll pardon the pun. So, this led to an interesting question of, “All right. I wound up connecting to one of my virtual machines, and there’s no good way for me to tell which physical server that virtual machine was connecting to.” How could we solve for this? Now, back in those days, with the Hypervisor technology we used, which was Xen, that’s X-E-N—it’s incidentally the same virtualization technology that AWS started out with for many years before releasing their Nitro Hypervisor, which is KVM derived, a couple of years ago. Again, not the point of this particular story. And one of the interesting pieces about how this works was that Xen doesn’t really expose anything, at least in those days, that you could use to query the physical host it was running on.


So, how would we wind up doing this? Now, at very small scale where you have two or three servers sitting somewhere, it’s pretty easy. You log in and you can check. At significant scale, that starts to get a little bit more concerning. How do you figure out which physical host a virtual instance is running on? Well, there’s a bunch of schools of thought you can approach this from. But what you’re trying to build is known, technically, as a configuration management database, or CMDB. This is, of course, radically different from configuration management, such as Puppet, Chef, Ansible, Salt, and other similar tooling. But, again, this is technology, and naming things has never been one of our collective strong suits. So, what do we wind up doing? You can have a database, or an Excel spreadsheet, or something like that that has all of these things listed, but what happens when you then wind up turning an old instance off, and spinning up a new instance on a different physical server? These things become rapidly out-of-date. So, what we did was sort of the worst possible option. It didn’t solve for all of these problems, but at least was able to address what we wound up doing. At least, let us address what the perceived problem was, in a way that is, of course, architecturally terrible, or it wouldn’t have been on this show.


DNS has a whole bunch of interesting capabilities. You can view it, more or less, as the phone number for the internet. It translates names to numbers. Fully qualified domain names, in most cases, to IP addresses. But it does more than that. You can query IP address and wind up getting the PTR, or reverse record, that tells you what the name of a given IP address is, assuming that they match. You can set those to different things, but that’s a different pile of madness that I’m certain we will touch upon a different day. So, what we did is we took advantage of a little-known record type known as TXT, or text, record. You can put arbitrary strings inside of TXT records and then consume them programmatically, or use a whole bunch of different things. One of the ways that we can use that, that isn’t patiently ridiculous is, domains generally have TXT records that contain their SPF record, which shows which systems are authorized to send mail on their behalf as an anti-spam measure. So, if you have something else that starts claiming to send email from your domain that isn’t authorized, that gets flagged as spam by many receiving servers.


We misused TXT records, because there is no limit, really, to how many TXT records you can have, and wound up using that as our configuration management database. So, you could query a given instance, we’ll call it webserver003.production.losangeles.company.com, which was our naming scheme for these things, and it would return a record that was itself a fully qualified domain name, but it was the name of the physical host on top of which it was running. So, yeah, we could then propagate that, as we could with any other DNS records, to other places in the environment, and we could run really quick queries, and in turn build out systems on the command line that you could put in the name of a virtual machine into, and it wound up returning, at relatively quick response times, the name of the physical host it was running on.


So, we could use that for interesting ways of validating, for example, we didn’t put all four of our web servers on the same physical load for a service. It was an early attempt at solving zone affinity. Now, there are a whole bunch of different ways that we could have fixed this, but we didn’t. And we wound up instead misusing DNS to build our own configuration management database, because everything is terrible and it worked. And because it worked, we did it. Now, why is this a terrible idea, and what makes this awful? Great question.


But first, in the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom, to pay for their ridiculous implementations. It doesn't have to be that way. Consider CHAOSSEARCH the data lives in your S3 buckets in your AWS accounts (and we know what that costs). You don't have to deal with running massive piles of infrastructure to be able to query that log data with APIs you've come to know and tolerate, and they're just good people to work with. Reach out to CHAOSSEARCH.io, and my thanks to them for sponsoring this incredibly depressing podcast.


So, the reason that using DNS as a configuration management database is inherently an awful idea comes down to the fact that first, there are better solutions available for this across the board. In those days, having an actual product for configuration management databases would have been a good move. Failing that, there were a whole bunch of other technologies that could’ve been used in this. And since we’re already building internal tooling to leverage this, having one other piece of additional tooling where things could automatically be the system of record would have been handy. We weren’t provisioning these things by hand. There were automated systems spinning them up, so having them update a central database would have been great. We had a monitoring system—well, we didn’t have a monitoring system, we had Nagios instead. But even Nagios, when it became aware of systems, could then, in turn, figure out where this was going to run and update a database. When a system went down permanently and we removed it from Nagios, we could have caught that and automatically removed it from a real database. Instead, we wound up using DNS. One other modern approach that could have been used super well, but didn’t really exist in the same sense back then, is the idea of tags in the AWS sense. Now you can tag AWS instances and other resources with up to 50 tags. You can enable some of them for cost allocation purposes, but you can also build a sort-of working configuration management database on top of it. Now this is, of course, itself a terrible idea, but not quite as bad as using DNS to achieve the same thing.


The best coda to this story, of course, didn’t take place until after I had already Tweeted most of the details I’ve just relayed here. And then I wound up getting a response, and again, this was last year in 2019, and the DM that I got said that, “You know, I read your rant about using DNS as a database, and I thought about it, and realized, first, it couldn’t be that bad of an idea, and secondly, it worked. In fact, I can prove it worked, because until a couple of years ago, we were running exactly what you describe here.” So, this is a pattern that has emerged well beyond the ridiculous things that I built, back when I was younger. And I did a little digging, and sure enough, that person worked at the same company that I had built this monstrous thing at, all the way back in 2007 era, which means that for a decade after I left, my monstrosity continued to vex people so badly that they thought it was a good idea.


So, what can we learn from this terrible misadventure? A few things. The morals of story are several. One, DNS is many things, but probably not a database unless I’m trying to be humorous with a tired joke on Twitter. Two, there’s almost always a better solution than what it is that you have built in your own isolated environment. Talking to other people gets you rapidly to a point where you discover that you’re not creating solutions for brand-new problems. These are existing problems worldwide, and someone else is almost certainly going to point you in a better direction than you are going to come to on your own. And lastly, one of the most important lessons of all is that just because you find something horrible that someone has built before your time, in an environment, it does not mean that they knew what they were doing. It does not mean that it even approaches the idea of a best practice, and you never know what kind of dangerous moron was your predecessor.


Thank you for joining us on Whiteboard Confessional.


If you have terrifying ideas, please reach out to me on Twitter at @quinnypig, and let me know what I should talk about next time.


Announcer: This has been a HumblePod production. Stay Humble.

Fri, 21 Feb 2020 03:00:00 -0800
EBS Gets Overly Multi-Attached
AWS Morning Brief for the week of February 17, 2020.
Mon, 17 Feb 2020 03:00:00 -0800
Polly Brand Voice Want a Platypus?
AWS Morning Brief for the week of February 10, 2020.
Mon, 10 Feb 2020 03:00:00 -0800
Networking in the Cloud Fundamentals: BGP Revisited with Ivan Pepelnjak

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Transcript
Corey: Hello and welcome to our Networking In The Cloud mini series, sponsored by ThousandEyes. That's right. There may be just one of you, but there are a thousand eyes. On a more serious note, ThousandEyes has sponsored their cloud performance benchmarking report for 2019, at the end of last year, talking about what it looks like when you race various cloud providers. They looked at all the big cloud providers and determined what does performance look like from an end user perspective? What does the user experience look like among and between different cloud providers? To get your copy of this report, you can visit snark.cloud/realclouds. Why real clouds? Well, because they raced AWS, Azure, GCP, IBM Cloud, and Alibaba, all of which are real clouds. They did not include Oracle cloud because, once again, they are real clouds. Check out your copy of the report at snark.cloud/realclouds.


Welcome to week 12 of the Networking In The Cloud mini series of the AWS Morning Brief, sponsored by ThousandEyes. So one of the early episodes of the Networking In The Cloud mini series had me opining and relatively uninformed broad brush strokes about the nature of BGP. Today I am joined by Ivan Pepelnjak, who is a former CCIE who wrote a fascinating blog post that I will link to in the show notes, saying, "This is great, but this is what happens when someone who's good at one thing steps completely out of their comfort zone into things they don't fully understand and start opining confidently, if not authoritatively." Ivan, thank you for taking the time to speak with me.


Ivan: Thanks for having me on. And no, I was way more polite than your summary.


Corey: Absolutely. I believe that there's a way to tell a story of the hero's journey that everyone talks about when they're building a narrative arc. Instead, I go for the moron's journey and I always like to be the moron because, generally, I tend to be, and as I walk through the world and get things sometimes right, occasionally wrong, I love being corrected when I stumble blindly into an area I don't know. First because it gives me an opportunity to learn something new, which is great, but it also gives me that opportunity to be the dumbest person in the room again, which is awesome. So...


Ivan: That's exactly why I blog to get your opinions.


Corey: Exactly. You have data, I have opinions and mine are louder seems to be the way that discourse works in the modern era. So from a high level, what did I get wrong about BGP?


Ivan: Well, you got everything right about the mess that we are in and the fragility of the generic internet infrastructure. The only thing you got wrong was that you blamed the tool, but not people using the tool.


Corey: It always feels like it's safer, on some level, to blame technology because if the takeaway is, "Well, the user experience around tool X isn't great, and that adds a contributing factor to why things break." That seems to be a message that carries slightly better than, "And thus the answer is for everyone to be smarter and stop screwing up." And that may very well be the answer. It's just a bitter pill to swallow sometimes. So I find blaming a tool is easy.


Ivan: Yeah, but it's like blaming the knives for people to get cut or blaming the chainsaw for people to cut off their arm because they were not properly trained.


Corey: One of my assertions was that BGP is more or less a hot mess because it was designed for an era when people on the internet fundamentally could trust one another and that doesn't seem to be the case today. The analogy in my mind, that I don't think I mentioned, was SMTP, the the email protocol, for lack of a better term. When that was built, the internet was more or less comprised of researchers and who in the world would ever abuse a protocol like email? It's not like there was any money involved in the internet. Fast forward today and your spam folder is inherently a garbage fire.


Ivan: Yeah, but BGP has a slightly different history. It was redesigned a few times. There were several attempts to get the global routing protocol right. And BGP, the last attempt, already included the tools that allow entities that don't trust each other, like commercial internet service providers, to exchange information and apply policies on inbound and outbound updates. So for example, I don't want to hear about your customers because I hate you and I don't want to peer with you or I don't want to tell you about my customer because that customer has a special deal and their traffic can only go through some other transit providers so I will not tell you about that customer. Those things were already a major requirement when BGP was designed and it always included the tools to implement the policies that individual commercial entities wanted to have, which by the way, never happens to SMTP. We have BGP version 4 now and we are still on SMTP version zero.one plus enhancements.


Corey: I guess the best analogy I can come up with through my exposure with BGP, because I tend to handle inter networking between various groups about as well as I write code, things that I have some vague awareness that there are things you should be doing here that I will almost certainly not get right, so I back away slowly and leave it to professionals. As a result, every time I really see how BGP works in any hands-on sense or a point where it's forced upon my awareness, it's similar to how I become aware of plumbing. I don't think about it. I don't question it. I just expect when I turn the faucet on or flush the toilet that water will do what it's going to do. I don't expect the toilet to explode. So the only time I think about BGP is when there is a peering dispute or when there's a flap or, on one notable occasion, when I was at a security conference and, as a demo, some folks hijacked the entire AS of the SN for the conference and rerouted it halfway around the world and back, which explained why everything was super latent and crappy.


Ivan: Yeah. You're absolutely right, but all the incidents you mentioned are not the fault of the tool. They are the fault of the tool not being properly used. And also, let's be honest, it took them hundreds of years to get the plan being to the point when you can just turn on the faucet and the clean and drinkable water comes out of it. It's not like that would have happened in the last year or two, and very probably it wouldn't have happened without public pressure to bring us drinkable water and interest in paying for the drinkable water and some wide regulation to ensure that, if the water company says the water is drinkable, it actually is drinkable, and we have none of those in a BGP or in generic internet, global internet infrastructure, I should say. Now you see where you got me, I started blaming the tool.


Corey: See isn't it addictive? Because it's easy to blame tools. When you start blaming individuals or people, it suddenly feels like, "Oh dear, now I'm calling people out. Sometimes intentionally, sometimes not." And then "Oh, did they ever come out of the woodwork?"


Ivan: Yeah. If we go back to, for example, the one example you mentioned where someone was able to hijack the whole conference autonomous system, that is because no one is looking at the updates that are being sent through the internet. For various reasons. A, because the service providers are not motivated to filter the announcements that their customers are sending. And B, in the internet core, you might be in the place where everything is so complex that you just don't want to touch anything. So all you have to do to hijack whoever you wish is you find the sloppiest possible tier one provider, find the sloppiest possible tier two provider that is connected to the tier one provider, because now you know that neither one of them will filter what you're sending out, and then you just start hijacking AWS, DNS servers, for example.


Corey: What is the answer to something like this, other than yelling at those sloppy providers to clean up their act?


Ivan: Well, there are two answers. One is customer pressure, but as long as the customers will go and buy the cheapest bandwidth possible without considering the quality of the service that the service provider is offering, we're not getting anywhere there. The other thing is regulation. There is a reason we have driving licenses and there is a reason that truck drivers have to pass a different exam than you and me because they could do more damage. We don't have anything like that on the global internet. It's totally unregulated, apart from, let's say, mutually agreed understanding that there are five organization worldwide who handle the address space and autonomous system numbers. Even there I am getting messages from a few mailing lists where every week someone is yet again describing how the crooks managed to hijack unused address space belonging to whatever legacy entity just because one of those five organizations that were supposed to do the right thing and take care of proper allocation of address space just didn't check the very basics of whether the request they got was legit or not.


Corey: The challenge of authenticating that something comes from who it claims to be generally feels like an authentication piece, is possibly an encryption story as well, but to my understanding that was only added to BGP after the internet was already a going concern. Is my history mistaken on that?


Ivan: No, you're absolutely right. You have two ways of solving this problem. One is with technology and the other one is with good processes because what's stopping us from having a global database of who owns what and having someone being responsible for that database? And then we can all use the information from that database to build filters. So for example, if you have AS number one and that database says that you only own one prefix, why would I ever accept more than one prefix from you, my customer? And why would I ever accept prefix from an address space that doesn't belong to you?


But of course that requires that A, everyone registers in that database and no one has ever made that mandatory, and B, that I, as the service provider, actually care about security. Honestly, for a sloppy service provider, it's cheaper not to care about security because caring about security causes support costs, it causes education of clueless customers and all that is costing money. It's way simpler to just accept everything, propagate everything, pollute the global internet with toxic waste, claim that it's not your fault but your customer's fault, and then everyone comes to the conclusion that BGP is a hot mess.


Corey: When did modern BGP, as it stands today, emerge in its current form?


Ivan: It's my vague memory that it must have been in the early 1990s.


Corey: Which is later than one would expect given that the internet predates that by a significant margin.


Ivan: Yeah. They had one routing protocol when ARPANET was still the core of the internet because then it was easy, everyone was sending information to our ARPANET and only ARPANET needed to know where everyone was. And then they figured out that no, this will not work and they invented a routing protocol, I think it was called EGP, and that thing worked for a little bit longer and then they figured out that no, this is not going to work. And then there was the famous coffee shop, or whatever chat between two engineers, resulting in the famous three napkins that were the original specification of BGP. That got implemented and that was version one. Then we had version two and version three, and I started playing with being an ISP when they were just migrating from version three to version four, which is what we have today, and that must have been in early 1990s. But I think that Russ White did a podcast on the history of BGP once and, if I ever find it, I'll send you the link to include in the show notes.


Corey: I would like to thank once again ThousandEyes for making this entire ridiculous rant possible. In addition to their cloud performance benchmark report, ThousandEyes winds up giving companies insight into what's going on on the broader internet: routing issues, provider failures upstream, different companies having different problems. It more or less is a real time traffic meets weather map for the internet. This helps companies who use them wind up getting a better perspective of what the current end user experience is and begin routing around provider failures, yelling at providers, et cetera, ideally before those errors become evident to customers. To learn more, visit ThousandEyes.com and tell them Corey sent you. In fact, they may very well say something like, "Wow, you heard about us from Corey and you still looked us up, what a Testament to how awesome our product is." My thanks again to ThousandEyes for putting up with my ridiculous nonsense to sponsor this ridiculous podcast.


Tell me a little bit about who you are and why you're well positioned to opine on these ethereal topics that those of us working in small, scrappy TwitterForPets-style startups don't generally have to think about these level of networking deep dive complexities. Who is Ivan Pepelnjak?


Ivan: Well, I started with networking in mid 1980s, and in those days there was no internet where I was, and then internet came to central Europe and I was, in that time, in Yugoslavia, so almost behind the iron curtain. In early 1990s, at which time I was already building local area networks, and then set up the first commercial ISP in my country. At that time approximately, we became Cisco partner and they pushed me into becoming one of the instructors, which I think I got the instructor number 12 worldwide or something along those lines, this was one of the first batches of Cisco certified instructors, and then I started developing courses for Cisco and, actually, BGP was my first course I developed for them. Years later date that turned into some official Cisco training and I have no idea whether they are still doing that or not or how that course would be called today.


Then the big internet bubble happened and we started offering professional services throughout Europe, designing and building large internet networks for the traditional service providers, and then that that bubble burst and I was smart enough to retire at approximately that time or, as someone said, "Took a long coffee break." Got bored, started blogging, and then figured out that there is this tiny little niche for someone to explain to networking engineers how the networking vendors are trying to oversell whatever they are doing. Whereas in reality, it's usually just recast off old stuff with new clothes and some shiny glamor on top of that so you don't figure out what's going on. And that's what I've been doing for the last almost 15 years.


Corey: You have a similar aspect of your business as I do for mine. Namely, you are independent, you are not backed by any particular vendor and, as a result, you're not sitting here with an agenda of, "Oh, you should do whatever you want, but as long as you're buying Cisco gear to do it," for example. You're a trusted voice in your space.


Ivan: Well, I would hope I am, but yeah, I don't have any vendor behind me. Actually, one of the vendor reps, one told me, "We don't care what you say about us as long as you're equally snarky towards everyone."


Corey: Exactly. You can be a jerk as much as you want, just make sure you're a jerk to everyone. That's part of it, for me at least. The other part, given my unique styling, has always been punch up, never down. The reason I own twitterforpets.com is because making fun of an actual startup where people are doing blood, sweat, and tears trying to get something off the ground just makes me a jerk.


Ivan: Yeah, likewise. I would never go after a small company, I would try to help them, but the major networking vendors are fair game.


Corey: Absolutely, and this also helps bring this mini series to a close by answering a question I didn't know how I was going to answer until we got here at the end, which is: if people want to learn more about networking in the cloud, now that I've more or less exhausted my knowledge on this, where can they go next? Until today the answer was "Idunno," but now I can say, "They talk to you." You can take them down the path of what modern networking in a cloud era looks like. You could be found at IPspace.net, as well as wherever fine networking snark is sold."


Ivan: Yes, exactly. Thank you.


Corey: Of course. You do webinars, you do a podcast of your own, you've written several books, and your blog is, I'm going to say, obnoxiously prolific.


Ivan: Yeah. I try to publish something on my blog every day, and sometimes it's just a pointer to some other stuff I'm doing. Sometimes it's a technical deep dive into a particular topic. I try to publish one rant per week to keep people amused and other people extremely angry. And yes, I do webinars on particular networking technologies, often with guest speakers, so right now we have more guests than my own webinars. Some of this would be on networking in public clouds, others would be on network automation. So yeah, if you want to know how networking really works in public clouds, either you go for the public cloud official training, AWS has something, Azure has something, or you can try and look at my stuff and see what my opinion is on what they're telling you.


Corey: Which I strongly endorse and recommend. Thank you so much for taking the time to correct some of my misunderstandings around what is, admittedly, a highly complex topic.


Ivan: You're most welcome. Thanks for having me.


Corey: Of course. Ivan Pepelnjak, IPspace.net, independent blogger, trusted voice, and gentle corrector when folks get it wrong. This has been the 12 week Networking In The Cloud mini series. Thank you one last time to ThousandEyes for their generous sponsorship of this ridiculous podcast mini series. Thank you again, Ivan, for correcting me when I get it wrong in a variety of fascinating but incredibly confident sounding ways. I am cloud economist Corey Quinn, if you've enjoyed this podcast, please leave a five star review on Apple podcasts. If you've hated this podcast, please leave a five star review on Apple podcasts, and tell me exactly what my problem is in the comments.


Announcer: This has been HumblePod Production. Stay humble.



Thu, 06 Feb 2020 03:00:00 -0800
Lies, Damned Lies, and Sponsored Benchmarks
AWS Morning Brief for the week of February 3, 2020.
Mon, 03 Feb 2020 03:00:00 -0800
Networking in the Cloud Fundamentals: Cloud and the Last Mile

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Transcript

Corey: Hello and welcome to our Networking in the Cloud, mini series sponsored by ThousandEyes. That's right. There may be just one of you, but there are a thousand eyes on a more serious note. ThousandEyes has sponsored their cloud performance benchmarking report for 2019 at the end of last year. Talking about what it looks like when you race various cloud providers. They looked at all the big cloud providers and determined what does performance look like from an end user perspective? What does the user experience look like among and between different cloud providers? To get your copy of this report, you can visit snark.cloud/realclouds. Why real clouds? Well, because they raced AWS, Azure, GCP, IBM Cloud and Alibaba, all of which are real clouds.



They did not include Oracle Cloud because once again they are real clouds. Check out your copy of the report at snark.cloud/realclouds. It's interesting that that report focuses on the end user experience because as this mini series begins to wind down, we're talking today about the last mile and its impact on what perceived cloud performance looks like. And I will admit that even having given this entire mini series and having a bit of a network engineering background, once upon a time, I still wind up in a fun world of always defaulting to blaming my local crappy ISP.



Now today, my local ISP is amazing. I use Sonic in San Francisco. I get Symmetric Gigabit. It's the exact opposite of Comcast who was my last provider until Sonic came to my neighborhood and it was fun that day because I looked up and down the block and saw no fewer than six Sonic trucks ripping Comcast out by the short and curlies. Which let's not kid ourselves, is something we all wish we could do and I was the happiest boy in town the day I got to do it. Now, the hard part is figuring out that yes, it is in fact a local ISP problem because it isn't always. This is also fortuitous because I spent the last month or so fixing my own local internet situation and today I'd like to tell you a little bit more about that as well as how and why.



Originally when I first moved into my roughly, we'll call it 2,800 square foot house, it's spread across three stories, I wound up getting EEROs, that's E-E-R-O. They're a mesh network set up that was acquired by Amazon after I'd purchased them. These are generation one. The wireless environment in San Francisco is challenging and in certain parts of my house, the reception as a result, wound up being a steaming bowl of horse crap. The big challenge was figuring out that, that's what the problem was. With weird dropouts and handoff issues, it was interesting. This one area that caused immediate improvement was not having these things talk to each other wirelessly as most full mesh systems will do, but instead making sure that they were cabled up appropriately to a switch, the central patch panel and then hooked them in. Now you have to be careful with switches because a lot of stuff won't do anything approaching full throughput because that can get expensive and a lot of consumer gear is crap.



This was a managed HP pro curved device back in the days that HP made networking equipment. That was great. And it's still crap, but it is crap that works at full line rate. So there's that. Next I wound up figuring that ... all right, it's time to take this seriously. So I did some research and talked to people I know who are actually good at things, instead of sounding on the internet like they're good at things. And I figured the next step was to buy some Ubiquiti Networks style stuff. Great. We go ahead and trot some of that out. It's an enterprise gear. It's full mesh. I of course now have a guest wifi that you have to pay for to use the hotspot. It's called Toss a coin to your wifi for an SS ID because of course it is. I have problems. And it's fun and I can play these stupid games, but suddenly every weird internet problem I had in my house started getting better as a result.



And it's astonishing how that changed my perception of various third party services. None of whom, by the way, had anything to do with my actual problem. But there were still some perceptual differences. And this impacts the cloud in a number of subtle ways and that's what I want to talk about today. So one of the biggest impacts is DNS. And I don't mean that in the sense of big cloud provider DNS, we've already talked about how DNS works in a previous episode. But rather what resolver you wind up using yourself. One of the things that I did as a part of this upgrade, is I rolled out a distribution of Linux called Pi-hole, which sounds incredibly insulting as applied to people as in, you know what, you should shut? Your Pi-hole. However, it's designed to run on top of Raspberry Pi and provide a DNS server that creatively blocks ads.



And that's super neat. I liked the idea of just blocking ad servers, but you have to trust whatever you're using for a DNS resolver because of a few specific use cases that I stumbled over as I went down this process. One, it turns out that having access to every website you'd care to visit as far as a list of things you've been doing, is not really the most privacy conscious thing in the universe. Now, for some reason, the internet collectively decided, you know who we trust with all the things that we look at on the internet and have no worries about giving that information to? That's right. Freaking Google. So eight dot eight dot eight dot eight, was a famously to remember open resolver and it works super well. It's quick. It returns everything. The problem is, is that Google's primary business model is very clearly surveillance and I don't do anything particularly interesting.



If you look at my DNS history, you're going to find a lot of things that you'd think you could use to blackmail me, but it turns out you actually can't because I talk about them on podcasts. That's right. I use Route 53 as a database. What of it? And it's all very strange just as far as even without anything to hide, I still feel this sense of pervasive creepiness at the idea that a giant company can look at this. Can look at my previous browsing history. So blocking things like that are of interest to me. So okay, instead, if I run Pi-hole that acts as my own resolver but then it winds up passing queries on to an upstream provider. I mean I could run my own, but that has other latency concerns and DNS latency when you're making requests is super indicative because the entire internet has gone collectively dumb. And decided to display a simple static webpage, You need to make 30 distinct DNS request in series and wait for them all to come back and other ridiculous nonsense that is the modern web today.



What makes this extra special is I figured out, okay, I'm not going to go with Google or CloudFlare because it has other problematic aspects to its business that we need not go into here, but okay, we'll pick Level 3 as another example. And Level 3 was recently sold to CenturyLink and they are, to be polite the devil because they break DNS in horrifying and obscene ways. Namely, whenever something isn't found, it returns a result to its own personalized search engine, which means that anything that you have that is depending upon certain ... an X domain or no resolution available results, suddenly has a result returned. Well, you can set a cookie in your browser to wind up avoiding that behavior so we don't see what the problem is.



The problem jack hole is that I can't set cookies in a DNS system that is querying it from a Daymond perspective. A whole bunch of email block lists software runs based upon DNS results and when you suddenly have results returning for everything, they either block nothing or they block everything to taste. And that winds up being obnoxious and difficult. Now, I'll talk to you a little bit more about how this impacts the cloud in just a second.



But first I would like to thank once again ThousandEyes for making this entire ridiculous rant possible. In addition to their cloud performance benchmark report, ThousandEyes winds up giving companies insight into what's going on, on the broader internet routing issues, provider failures upstream, different companies having different problems. It more or less is a real time traffic meets weather map for the internet. This helps companies who use them wind up getting a better perspective of what the current end user experience is and begin routing around provider failures, yelling at providers, et cetera, Ideally before those errors become evident to customers.



To learn more, visit ThousandEyes.com and tell them Cory sent you. In fact, they may very well say something like, "Wow, you heard about us from Corey and you still looked us up. What a Testament to how awesome our product is." My thanks again to ThousandEyes for putting up with my ridiculous nonsense to sponsor this ridiculous podcast. Now I hate running hardware mostly because I have a knack for breaking things, so when I ran the Pi-hole, sorry, I still can't take that name seriously. I didn't want to run it at an actual Raspberry or anything locally myself that I could accidentally spill water into. If you take a look at my previous history with laptops and liquids, you can understand why I have that perspective. So instead I decided I wanted to just run it in an EC2 instance because "Hey, those credits aren't going to spend themselves."



This is odd because normally when I'm trying to build something in the cloud, I put it in either Oregon or Virginia. For this thing, I parked it in the more expensive capacity constrained region known as US West One because it is close to Northern California where I personally call home. The latency was far lower, which means for every round trip to talk to a DNS server over a VPN because I'm not an irresponsible moron, it takes less time. Which leads to a faster perceived user experience. Now, am I saying that automatically having a closer DNS server is going to lead to a faster user experience? Not necessarily. And depending entirely upon the profile of what application you're interacting with, you're going to see mixed results, but generally speaking, a faster responding DNS server is going to have a positive overall experience on what it's like to be using online web applications for whoever you happen to be and whatever you tend to be accessing.



So having a resolver that is closer to me was important, not important enough for me to actually have a Raspberry PI or something like that here that I have to feed, maintain, watch get compromised, turn into an attack platform, and destroy my television slash scale which are currently busy monitoring a DDoS attack against the entire internet infrastructure because everything is terrible. Now this is an optimization that can lead to a better experience on the end user side, but ... and this is where I think a lot of cloud providers get things hilariously wrong, it's usually not latency from a DNS server or latency between where your customer is and the resource that they're speaking to. It's that most applications, be they web app or otherwise are designed like complete crap. They have a series of different connections to different hosts that need to all be contacted, resolved correctly, and transfer data to render a page correctly.



One of the best examples we saw of this was when GDPR first came out and rather than a justice for folks who are admitting from Europe, a bunch of websites just dropped all the tracking, all the garbage, and you could see for a while, sites like the Los Angeles Times would load in 20k and fractions of a second instead of the 30 megabyte monstrosities we see now. If you take a look at any given webpage that loads today, it's enormous and it's crappy and having faster DNS resolution times is helpful, sure. But because you're straining raw sewage with your teeth when dealing with the internet, it's worse than just showing you ads. It's tracking behavior, it's malware shoved under ad networks. If you take a look at any given website property that I have with last week and AWS with this podcast, et cetera, yes, I sell ads.



That's normal. What I don't do is track the living hell out of my customers. If you wind up visiting the website from a browser that has significant ad blocking turned on like I do, you don't get a degraded experience. You don't see a whole host of things getting destroyed and it's still not perfect. I still think we're pulling CSS from third party hosted sites, which is monstrous and obnoxious to me, but I'm not a web designer. I pay other people for that problem. In conclusion, yes, improving your local environment and your crappy ISP to a better ISP is going to lead to better outcomes. But there's so much else that can be done before getting to that point. And I think collectively, we missed the boat a lot when we talk about how performance is a cloud providers ultimate responsibility, yada, yada, yada. No, it's ours.



Stop designing things like crap and start treating customers with a modicum of respect and I suspect that things will become a lot better for folks who are creating interesting content and sharing useful things. Sure, the rent seekers are going to have different problems, but I don't really care what happens to them. I don't mean to be unkind, but there we have it. In conclusion, the last mile matters a lot, partially because it is the experience filter through which everything we do on the internet matters to customers and partially because so many of these things treat it completely like garbage. That's it for this episode. I am Corey Quinn. This is our Networking in the Cloud mini series of the AWS morning brief. For comments, feedback, et cetera, you can find me on Twitter @Quinnypig. Q-U-I, double N-Y pig. Or make sure to leave a five star review for this episode regardless of what you actually thought of it and tell me exactly what my problem is in the comments. Thanks again to ThousandEyes for their continuing and ongoing support of this ridiculous podcast.


Announcer: This has been a HumblePod production. Stay Humble.

Thu, 30 Jan 2020 03:00:00 -0800
Dedicated T3 Instances Burst My Understanding
AWS Morning Brief for the week of January 27th, 2020.
Mon, 27 Jan 2020 03:00:00 -0800
Networking in the Cloud Fundamentals: Connectivity Issues in EC2

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Transcript

Corey: Welcome to the AWS Morning Briefs miniseries, Networking In the Cloud, sponsored by ThousandEyes. ThousandEyes has released their cloud performance benchmark report for 2020. They effectively race the top five cloud providers. That's AWS, Google Cloud Platform, Microsoft Azure, IBM Cloud, and Alibaba Cloud, notably not including Oracle Cloud, because it is restricted to real clouds, not law firms. It winds up being derived from an unbiased third party and metric-based perspective on cloud performance as it relates to end user experience. So this comes down to what real users see, not arbitrary benchmarks that can't be gamed. It talks about architectural and conductivity differences between those five cloud providers and how that impacts performance. It talks about AWS Global Accelerator in exhausting detail. It talks about the Great Firewall of China and what effect that has on cloud performance in that region, and it talks about why regions like Asia and Latin America experience increased network latency on certain providers. To get your copy of this fascinating and detailed report, visit snark.cloud/realclouds, because again, Oracle's not invited. That's snark.cloud/realclouds, and my thanks to ThousandEyes for their continuing sponsorship of this ridiculous podcast segment.


Now, let's say you go ahead and spin up a pair of EC2 instances, and as would never happen until suddenly it does, you find that those two EC2 instances can't talk to one another. This episode of the AWS Morning Brief's Networking in the Cloud Podcast focuses on diagnosing connectivity issues in EC2. It is something that people don't have to care about until suddenly they really, really do. Let's start with our baseline premise, that we've spun up an EC2 instance, and a second EC2 instance can't talk to it. How do we go about troubleshooting our way through that process?


The first thing to check, above all else, and this goes back to my grumpy Unix systems administrator days is: are both EC2 instances actually up?


Yes, the console says they're up. It is certainly billing you for both of those instances, I mean, this is the cloud we're talking about, and it even says that the monitoring checks, there are two by default for each instance, are passing. That doesn't necessarily mean as much as you might hope. If you go into the EC2 console, you can validate through the system logs that they booted successfully. You can pull a screenshot out of them. If everything else was working, you could use AWS Systems Manager Session Manager, and if you'll forgive the ridiculous name, that's not a half bad way to go about getting access to an instance. It spins up a shell instance in a browser that you can poke around inside that instance within, but that may or may not get you where it needs to go. I'm assuming you're trying to connect to one of those instances or both of those instances and failing, so validate that you can get into both of those instances independently.


Something else to check. Consider protocols. Very often, you may not have permitted SSH access to these things. Okay, or maybe you can't ping these and you're assuming they're down. Well, an awful lot of networks block certain types of ICMP traffic, echo requests, for example. Type eight. Otherwise, you may very well find that whatever protocol you're attempting to use isn't permitted all the way through. Note incidentally, just as an aside, that blocking all ICMP traffic is going to cause problems for your network. When things are fragmented and they need to have a different window size set for things that are being sent across the internet, ICMP traffic is how things are made aware of that. You'll see increased latency if you block all ICMP traffic, and it's very difficult to diagnose, so please, for the love of God, don't do that.


Something else to consider as you go down the process of tearing apart what could possibly be going on with these EC2 instances not able to speak to each other. Try and connect to them via IP addresses rather than DNS names. Just because there's ... I'm not saying the problem is always DNS, but it usually is DNS, and this removes a whole host of different problems that could be manifesting if you just go by IP address. Suddenly resolution, timeouts, bad DNS, et cetera, fall by the wayside. When you have a system that you're trying to talk to another system and you're only using IP, suddenly there's a whole host of problems you don't have to think about. It goes well.


Something else to consider in the wonderful world of AWS is network ACLs. The best practice around network ACLs is, of course, don't use them. Have an ACL that permits all traffic, and then do everything else further down the stack. The reason is that no one thinks about network ACLs when diagnosing these problems. So if this is the issue, you're going to spend a lot of time spinning around and trying to figure out what it is that's going on.


The next more likely approach, and something to consider whenever you're trying to set up different ways of dividing traffic across various regimes of segmentation, is security groups. Security groups are fascinating, and the way that they interact with one another is not hugely well understood. Some people treat security groups like they did old school IP address restrictions, where anything in the following network, and you can express that in CIDR notation the way one would expect, or C-I-D-R depending on how you enjoy pronouncing or mispronouncing things, can wind up being used, sure, but you can also say members of a particular security group are themselves allowed to speak to this other thing. That, in turn, is extraordinarily useful, but it also means extremely complex things, especially when you have multiple security groups layering upon one another.


Assuming that you have multiple security group rules in place, the one that allows traffic is likelier to have precedents. Note as well that there's a security group rule that is in place by default that allows all outbound traffic. If that's gotten removed, that could be a terrific reason why an instance is not able to speak to the larger internet.


One thing to consider when talking about the larger internet is what ThousandEyes does other than releasing cloud benchmark performance reports. That's right. They are a monitoring company that gives a global observer perspective on the current state of the internet. If certain providers are having problems, they're well positioned to be able to figure out who that provider is, where that provider is having the issue, and how that manifests, and then present that in real time to its customers. So if you have widely dispersed users and want to keep a bit ahead of what they're experiencing, this is not a bad way to go about doing it.


ThousandEyes provides a real time map, more or less, of the internet and its various problems, leading to faster time to resolution. You understand relatively quickly that it's a problem with the internet, not the crappy code that you've just pushed into production, meaning that you can focus your efforts on remediating the problem, where they can serve the customer better rather than diagnosing and playing the finger pointing game of, "Whose problem really is this?" To learn more, visit thousandeyes.com. That's thousandeyes.com, and my thanks to them for their continuing sponsorship of this ridiculous podcast.


Further things to consider when those two EC2 instances are unable to connect to each other. Are you using IPV4 or IPV6? IPV6 is increasingly becoming something of a standard across the internet. When things can't speak on IPV6, certain things are now manifesting as broken. Adoption is largely stalled in North America for some networks, but for others, it's becoming increasingly visible and increasingly valuable. Make sure that if you are trying to communicate via IPV6 that everything end to end is in fact working.


Something else to consider when you're doing diagnoses on these two instances that can't talk to each other. Can you get into each one of them individually? Can the instance that you get into speak to the broader internet? Can they hit other instances? Effectively, what you're trying to solve for here is fault isolation, of drilling down to a point where you're trying to figure out that this one instance has the problem. It's unlikely that the problem is going to apply to both instances at the same time.


Other things to consider is something more on the host side. Are both of these instances, for example, in the same region, in the same subnet of the same VPC? If they're supposed to be, are you sure that you've set them up that way?


Remember, private IP addressing can be the same in different VPCs in different regions. So if you think they're in the same region, and they're not, that could be a terrific instance of why you aren't actually able to speak. You'd have to communicate across the public internet and use public IP addresses rather than private IP addresses. Understanding what IP addresses each of these instances have is going to be critical for figuring out why it's not speaking correctly.


A further thing to consider while you're poking around on instances themselves. Are there host-based firewalls? On Linux, you have IP tables. On BSD and its derivatives, you have PF, but there are a bunch of different answers here just to ensure that on a local host basis, that you can handle packets accordingly.


Now, my best practice that I advise is don't do network controls with host-based firewalls. The reason is that it's very difficult to manage at large scale. It's challenging to remember that that's where it is, and as we go down this path, figuring out exactly what's causing these problems is challenging. It doesn't lead to a great outcome, and it adds work, and I don't think it's likely that this is going to be your problem, but it's certainly worth considering.


Something else to consider as well is, is it possible that there's a bad route, where one of these instances does not have a proper route either to the internet or to another subnet that you're attempting to speak to. This is largely handled for you by AWS, but by the time you've gotten to this level of the troubleshooting path, that's not necessarily guaranteed. It's something to consider. Is there a route that's missing? Is there a route that's incorrect? Can this thing talk to the broader internet, assuming it's supposed to be able to? If it can't, well, there's your potential problem. There's really a sort of a troubleshooting flow chart, mentally, that I go down when I start thinking about problems like this. It's not anything that's so formalized, but it's one of those, "What sorts of things cause these issues, and what would cause certain failure modes to manifest in a way that is aligning with what I'm seeing."


I'm a big believer as well in spinning up another copy of an instance, because hey, it's not that expensive to spin one of those things for five minutes right next to the one that's broken, and see, "Okay, is it something that is afflicting this instance as well, or is it something that is happening globally?"


Depending on your use case, it may not be an appropriate way of solving the problem, but if I can replicate the problem with a very small test case, suddenly it's a lot easier for me to take what I've learned and explain it to someone else when asking for help. When you're asking someone to help solve a problem, saying that I spin up an instance, and I attempt to connect on port 22 SSH to another instance in the same subnet and it doesn't work, that's a very small isolated problem case that you're likely to be able to get good help from, be it through a community support resource or even AWS support. Whereas if you have this complicated environment and you're only ever able to test it in that environment, well, I have this special series of instances spun up from a custom AMI. Yes, it's pronounced A-M-I. Do not let them call it Ah-mee, they're mispronouncing it if they do, and it's not able to speak to this particular instance on this particular port when the stars align, and we're seeing this certain level of traffic. If it has to be part of an existing larger environment, your troubleshooting is fundamentally going to be broken and bizarre in a whole host of different ways. It doesn't make it easier.


So I'm a big believer in getting down to the smallest possible test case, and again, because this is cloud resources, you can spin up effectively everything in a completely different account in not too much time. So sitting here spinning your wheels trying to diagnose network and conductivity issues should not be the sort of thing that takes you days on end. You should be able to clear out an entire test suite of everything I've just described in an hour or two. It doesn't take that much work to spin up new resources, but back in the days of data centers, it took weeks to get things provisioned, so of course we weren't able to provision a whole new stack. A whole new series of switches and routers and cables and servers and VMs on those servers was just not going to happen.


Instead, we don't have to see that problem now. We can spin up the entire stack quickly, and that's, from my perspective at least, one of the most transformative aspects of the cloud. When you have a question that you're not sure how to answer, you can spin up a test case and see for yourself.


That's all I've got on our connectivity troubleshooting episode of the AWS Morning Brief Networking in the Cloud podcast miniseries segment. My thanks, as always, to ThousandEyes for their generous sponsorship of this ridiculous, ideally entertaining, but nonetheless educational podcast segment. I will talk to you more next week about networking in the cloud, and what that looks like and how things break. Thank you again for listening. If you've enjoyed this podcast, please leave a five star review on Apple Podcasts. If you've hated this podcast, please still leave a five star review on Apple Podcasts and tell me exactly what my problem is.


Announcer: This has been a HumblePod production. Stay humble.

Thu, 23 Jan 2020 03:00:00 -0800
AWS Back-All-The-Way-Up
AWS Morning Brief for the week of January 20th, 2020.
Mon, 20 Jan 2020 03:00:00 -0800
Networking in the Cloud Fundamentals: Data Transfer Pricing

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Transcript

Corey: Welcome to the AWS Morning Brief, specifically our 12-part mini series, Networking In The Cloud, sponsored by ThousandEyes. ThousandEyes recently released their state of the cloud benchmark performance report. They raced five clouds together and gave a comparative view of the networking strengths, weaknesses, and approaches of those various providers. Take a look at what it means for you. There's actionable advice hidden within, as well as incredibly useful comparative data, so you can start comparing apples to oranges instead of apples to baseballs. Check them out and get your copy today at snark.cloud/realclouds. That's snark.cloud/realclouds because Oracle cloud was not invited to participate.


Now, one thing that they did not bother to talk about in that report, is how much all of that data transfer across different providers costs. Today I'd like to talk about that, which is a bit of a lie because I'm not here to talk about it at all, I'm here to rant like a freaking lunatic for which I make no apologies whatsoever.


This episode is about data transfer pricing in AWS. Because honestly I need to rant about something and this topic is entirely too near and dear to my heart, given that I spend most of my time fixing AWS bills for interesting and various sophisticated clients.


Let's begin with a simple question. The answer to which is guaranteed to piss you off like almost nothing else. What does it cost to move a gigabyte of data in AWS? Think about that for a second. The correct answer, of course, is that nobody freaking knows. There is no way to get a deterministic answer to that question without asking a giant boatload of other questions.


Let me give you some examples, and before I do, I would like to call out that every number I'm about to mention applies only to us-east-1, because of course different regions in different places have varying costs, that every single one of these numbers is different in other places sometimes, but not always. Why? Because things are awful. I told you I was going to rant. I'm not apologizing for it at this point.


Let's begin simply and talk about what it takes to just shove a gigabyte of data into AWS. Now in most cases that's free. Inbound bandwidth is always free to AWS usually, until it passes through with load balancer or does something else but we'll get there. What does it cost to move data between two AWS regions? Great. The answer to that is, two cents per gigabyte in the primary regions, except there's one use case which gets slightly less. And that is moving between us-east-1 and us-east-2. One is in Virginia, two is in Ohio. That is half price at one cent per gigabyte. My working theory behind that is that it's because even data wants to get the hell out of Ohio.


Let's take it a step further. Let's say you were in an individual region. What does it cost to move data from 1-AZ to another? The documentation was exquisitely unclear, and I had to do some experiments with spinning up a few instances in otherwise empty AWS accounts, and using DD and Netcat to hurl data across various links to find out the answer and then wait till it showed up on my bill. The answer is it also costs 2 cents per gigabyte, the same cost as region to region. It's one cent per gigabyte out of an AZ and one cent per gigabyte in to an AZ. And that's right, it means you get charged twice. If you move 10 gigabytes, you are charged for 20 gigabytes on that particular metric.


This also has the fun ancillary side effect of meaning that moving data between Virginia and Ohio is cheaper to do that cross region transfer than it is to move that same data within an existing region. Oh wait, it gets dumber than that. What do load balancer data transfer fees look like? The correct answer is who the hell knows? On the old classic load balancers, it was 0.8 cents per gigabyte in or out to the internet and there was also an instance fee, but that's not what we're talking about today. Traffic from any existing load balancer today to something inside of an AZ is free unless it crosses an availability zone and then we're back into cross AZ data transfer territory and anything going from an availability zone to a load balancer costs one cent per gigabyte.


Now the newer load balancer generations, the ALDs and the NLDS, what does that cost? Nobody freaking knows because data throughput is just one of several dimensions that go into a load balancer capacity unit, which mean that what your data transfer price is going to look like is going to vary wildly because in this particular case, it's not data transfer itself. There's still that as it leaves, but you also have to pay for this as an additional through the load balancer fee, but it's blended into an LCU, so it's not at all obvious at times that that is in fact what you're being billed for.


In another episode of this mini series, we talked about global accelerator. Now there's a site to site VPN option, which they had for a while, but at re:Invent last year they announced a accelerated VPN option that leverages a lot of global accelerator technology to let that site to site VPN take advantage significantly of the global accelerator. Now what does that cost? I could not freaking tell you. There are, I am not exaggerating, five distinct billing line items, if you run an accelerated site to site VPN and of course, all of them cost you money. I am not exaggerating. That is the actual state of the world. It is incredibly annoying. It is so annoying that I'm going to have to take a break before I blow a blood vessel to tell you more about ThousandEyes instead.


So other than the cloud report, what is ThousandEyes? They effectively act as the global observer that watches the entire internet from a whole bunch of different listening posts around that internet and keeps track in near real time of what's going on, what's being slow, what providers are having issues and giving information directly to your folks on your side to be able to understand, adapt and mitigate those outages and slow downs. It helps immediately get to the point of is this a networking problem globally or is it our last crappy code deploy that broke things? If this sounds like something that might be useful for you or your team, I encourage you to check them out at thousandeyes.com. They're a fantastic company with a fantastic product and best of all their billing makes sense.


We're back to ranting again. That's right. My problem with the AWS data transfer pricing is not that it's shitty and complex, but also that it's expensive. Pricing largely has not changed since AWS launched and you're effectively seeing 1998 bandwidth prices as a direct result of this. In data center land, the way this works is you pay for a link between two places and however much traffic you put over it, you're charged at the 95th percentile, so you can have bursts and spikes that exceed that limit, but you're paying effectively a flat rate for whatever your throughput looks like over the course of the billing period. It's not the most straightforward thing in the world, but it's a lot less expensive than you wind up paying for the same thing in the cloud.


Somehow AWS has managed to successfully convince an entire generation of companies that bandwidth is a rare, precious, expensive commodity. Unless of course it's bandwidth directly into AWS from the internet, in which case it is of course free, and you can have as much of that as you want. Data checks in, it doesn't check out. This in turn leads to a lot of weird patterns. For example, if you have a mobile app that winds up reporting data to something that lives in an AWS region, rather than having that replicated on your dime, you could theoretically have that mobile app report it to two different regions. It doubles your user's bandwidth on potentially a mobile plan, but it saves you money. How crappy of a dynamic is that?


Now there are other services that wind up leveraging aspects of data transfer pricing in obnoxious ways and there are ways around this too. PrivateLink, for example, a link between two different VPCs, in some cases in different accounts, saves you money on the data exodus charge so you don't have to go across the internet to do it. Great, sure, that's right. It drops it down to one cent per gigabyte in each direction, but you're still paying, at scale, a significant amount of money for what is in effect AWS just moving data around its internal network.


Direct Connect, the service that links a AWS VPC to your on premises data center also saves you money on that for data out that traverses from AWS to your data center, but in reverse, it costs you more than using the internet because again, ingress is free. You could theoretically have data if you're doing a large copy from your data center into AWS, use the public internet and put it directly into S3 or something like that. Why? Because this entire story is a carnival of bullshit. It's awful. Nobody likes it.


Let's talk about CloudFront. AWS's CDN product. It's kind of spendy as far as CDNs go, not horribly so, but what's fun about this is first off, what it costs for data in or out of CloudFront varies depending upon what region that data access is coming from and you don't have fine grain control over any of that. You don't know where your customers are going to come from and in some cases you can pay three times more if a customer accesses your same application with the same traffic pattern from different parts of the world than, if they do it from down the street from your office. And what's even more obnoxious about this is they still have an obnoxious competitive advantage over other CDN offerings because unlike anyone else, they can privilege their own environment and say that traffic from the origin to the CloudFront distribution is free. No one else can do that. So if you have a lot of data that you're pushing through CloudFront that isn't easily cached, that is an advantage that CloudFront has that makes it very expensive to even consider using something else.


This ties into a bigger challenge, notice as well that AWS has a large pile of, let's call them substandard offerings in some respects, that are managed service equivalents of things you could build and run yourself, RDS for a bunch of database offerings. Amazon Elasticsearch, Neptune, DocumentDB, and a whole host of other managed services that have presences in multiple availability zones, offer free replication of that data when you use the managed service. So if you can tolerate the obnoxious sharp edges of those managed services and it works well enough for your use case, they have a tremendous leg up over running your own implementation of the open source products that those things ape. Or having a third party vendor that manages that service for you because one way or another, unless you're using the AWS version of it, either the vendor who's managing it for you or you, have to pay that data transfer charge between AZ's. In some cases, both of you do. Only Amazon gets to ride data transfer between AZ's for free.


This one isn't particularly a data transfer charge directly, although it looks a lot like one, I speak of the managed NAT Gateway, there's a data processing fee in addition to the instance fee, of four and a half cents per gigabyte. That doesn't sound like a lot until you realize that if you have a web facing application that you put in a private subnet that needs to talk to the outside world, every gigabyte you put through costs four and a half cents, that becomes a massive expense. If you run your own NAT Instances, sure you have more overhead, the hourly charge is about the same as you'll spend for the managed NAT Gateway, but the data processing fee completely vanishes. Remember as well that that data processing fee is in addition to any data transfer fees you would be paying. So you go from four and a half cents per gig, to nothing.


Similarly and very tightly related, if you forget to put a free S3 Gateway endpoint in a private subnet that has a managed NAT Gateway in it, every gigabyte of data that you transfer into S3, in turn winds up having to incur that four and a half cent per gigabyte charge. That is the same it costs to store that gigabyte and us-east-1 for just shy of two months. That's enormous. You can make it free, but it's laid there for you as a trap for the unwary.


Lastly, if you're thinking that all regions are equivalent, they're not. You can see four to five times the expense, if you go to the region in Sao Paulo. That is not entirely AWS's fault, it's largely due, to my understanding, from telecom monopolies in that part of the world that make bandwidth incredibly expensive for virtually everyone. That's not something that I can fault them with, but it's still irritating.


What I can fault them with is that absolutely all of this, everything that I have spoken about today shows up on your bill in a whole mess of places that are incredibly difficult to unpack. Sometimes you're charged for the same data multiple times in different places and heaven forbid you try to figure out what's caused the change, which application workload do you have in that account that is suddenly responsible for a whole lot of AZ crosstalk? The only real answer without efficient tagging and the foresight to have those tags in place before this moment, is looking at VPC flow logs, which you've enabled, right? But those are annoying and confusing and difficult to parse as well.


Look, I'm not saying that this is intentional on AWS's part, I'm saying this is just the opposite. I'm saying when it comes to data transfer pricing, it is very clear that no one is effectively minding the store. The folks who build out the networks and the folks who handle pricing and the folks who work on service teams that leverage both of those things, apparently none of those people are allowed to talk to one another. Data transfer is the white space between AWS services. The next time your AWS account manager asks how they can help you with anything, please be sure to yell at them about the data transfer pricing portion of your bill. It's the only way that this apparently will one day get fixed for everyone.


I'm cloud economist Corey Quinn. This is the Networking In The Cloud mini series. If you've enjoyed this podcast, please leave a five star review and a comment telling me what you liked. If you didn't like this podcast, please leave a five star review anyway, and a comment telling me what my problem is.


Announcer: This has been a HumblePod production. Stay humble.

Thu, 16 Jan 2020 03:00:00 -0800
Your Database Will Explode in Sixty Seconds
AWS Morning Brief for the week of January 13th, 2020.
Mon, 13 Jan 2020 03:00:00 -0800
Networking in the Cloud Fundamentals: The Cloud in China

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Transcript
Corey: Welcome back to Networking In The Cloud, a special 12 week mini feature of the AWS morning brief sponsored by ThousandEyes. This week's topic, The Cloud in China, but first, let's talk a little bit about ThousandEyes. You can think of ThousandEyes as the Google maps of the internet, just like you wouldn't leave San Jose to drive to San Francisco without checking which freeway to take because local references are always going to resonate the best when telling these stories, business rely on ThousandEyes to see the end to end paths that their applications and services are taking from their servers to their end users, to identify where the slowdowns are, where the pile ups are, and what's causing these issues. They can use ThousandEyes to figure out what's breaking and ideally notify providers before their customers notice. To learn more, visit thousandeyes.com. And my thanks to them for their sponsoring of this mini series.


Now, when we're talking about China, I want to start by saying that I'm not here to pass judgment. Here in the United States, we're sort of the Oracle cloud of foreign policy, so Lord knows that my hands aren't clean any. Instead, I want to have a factual discussion about what networking in China looks like in the world of cloud in 2020. To start, China is a huge market. The market for cloud services in China this year is expected to reach just over a hundred billion dollars. So there's a lot of money on the table, there's a lot riding on companies making significant inroads into an extremely lucrative market that is extremely technologically savvy.


Historically, according to multiple Chinese cloud executives who were interviewed for a variety of articles, China's enterprise IT market is probably somewhere between five to seven years behind most Western markets. That means that there's a huge amount of opportunity for companies to be able to make inroads and make an impact on that market before it winds up being dominated, like a lot of the Western markets have been by certain large Seattle-based cloud providers, ahem, ahem.


Now, due to Chinese regulations, in order to run a cloud provider in China, it has to be operated by a Chinese company. That's why Microsoft works with a company called 21Vianet, whereas AWS has two partners, Beijing Sinnet and NWCD. Those local partners in fact own and operate the physical infrastructure that the cloud providers are building in China and become known as the seller of record. Although the US cloud companies of course do, or at least ostensibly retain all the rights to their intellectual property, either trademarks, their copyrights, etc.


That said, if you take a look at any of the large cloud providers, service and region availability tables, there's very clearly a significant lag between when services get released in most regions and when they do inside the mainland China regions. Some of the concern, at least according to people off the record, comes down to concern over intellectual property theft. And in the current political climate where we have basically picked an unprovoked trade war with China, it winds up complicating this somewhat heavily. If for no other reason, then companies are extremely skittish about subjecting what they rightly perceive to be their incredibly valuable intellectual property to the risks of operating inside of mainland China, so on the one hand they don't want to deal with that. On the other, there are over half a billion people in China with smartphones, just shy of 900 million people on the internet in one form or another. So there's an awful lot of money at stake. So companies find themselves rather willing to overlook some things that they otherwise would not want to bother with. Now again, I'm not here to moralize, I just find the idea to be somewhat fascinating.


Most of that stuff you can find out just from reading news articles and various press releases. So let's go a little bit further into how companies are servicing the Chinese market. Not for nothing, but picking on AWS because they are the incumbent in this space, and this is the AWS morning brief. But looking at the map on my wall, they have regions in Tokyo, in Seoul, in Hong Kong, in Singapore and Mumbai. If you squint enough, that sort of forms a periphery around the outside of mainland China. Here in the real world, if it's at all feasible, companies tend to use those regions that are more or less scattered around China, rather than within China if it is even slightly feasible and then provide services to their customers inside of China through those geographically local regions without having to deal with having a physical presence inside of China. You can learn a lot about this by looking at ThousandEyes 2019 Public Cloud Performance Benchmark Report, where they wound up figuring out what's going on with IBM, AWS, Azure and Google Cloud, and of course Alibaba this year, which is interesting and we'll get there in a minute because this is restricted to real clouds.


Oracle cloud is not a real cloud and thus was not invited. Figure out what the different architectural conductivity differences are between these cloud providers. Take a look at the AWS global accelerator and how it pans out and what you can actually expect from real world networks going to other real world networks, and see what it is that makes sense for various use cases. My thanks again to ThousandEyes for sponsoring this podcast. You can get your own copy of the report at snark.cloud/real clouds, that's snark.cloud/realclouds.


One of those real clouds as mentioned is Alibaba. The reason that I bring them up is that they currently dominate China's cloud market. Alibaba has something on the order of a 43% market share inside of mainland China. Second behind them with 17.4% is 10 Cent. 10 Cent is also growing rapidly. AWS is up there as well, given their significant posture and other places. But then there's a whole smattering of small scale cloud operators that are still vying for a piece of a very large, very lucrative pie.


Now, if you're talking to any of those providers inside of China, then the networking works pretty much like you'd expect it to anywhere else on the planet. The challenge and why this is worth an entire episode is what happens when you try to network outside of China into the rest of the internet. Let's talk a little bit about China's great firewall. This was started roughly in 1998 in order to enforce Chinese law. News, shopping sites, stereo search engines and pornography are all blocked through a wide variety of methods in accordance with Chinese law, that tends to change and ebb and flow. Not everything is blocked all the time and keeping up with it is more than a full time job. Last week and the great firewalls block list, however, would not be nearly as interesting of a newsletter so I don't write that one.


They do this through a variety of different methods. DNS can be black holed to the point where no regular domain name doesn't resolve to anything that works. IP addresses can be routed to absolutely nowhere. When they get a little bit more sophisticated with some other approaches, they can conduct deep packet inspection on traffic that traverses the firewall and determine whether or not a given request should be serviced or not. This isn't just a pass or fail scenario. The process behind this can also add significant latency. Corporate VPNs for example, can die randomly or work in the morning but then fail in the afternoon, and then come back again in the morning and then come back again to working by later that evening.


They can attempt man in the middle attacks, defeat TLS or SSL, depending upon which term you prefer, don't at me. And what's fascinating is that a number of VPN technologies are treated differently. Open VPN is fascinating in that, some key exchanges are not permitted at all and others are permitted but are slowed to a speed of less than 56 kilobits per second. IP SAC also suffers from that dramatic speed reduction as well. So good luck replicating virtually anything over that slow of a link. So if you're trying to replicate data from a region outside of China into China, you have to understand that sometimes, that replication link is going to break, other times it's going to wind up being incredibly slow, and still other times if you want to get around that, you just can't use encryption and just subject all of your valuable corporate data to inspection, not just by the Chinese government, but by anyone who can get a sniff of that traffic between the two end points between which it's speaking.


So that's generally a nonstarter as Verner says in his t-shirts at the re-invent keynote he loves giving, "encrypt everything." I have a spoof t-shirt that says, "encrypt everything unless it's hard." But traversing international borders is one of those times you absolutely want to have things encrypted. It's the only thing that really makes sense.


So what's the takeaway here? What does this mean for you that's actionable if you're needing to do business inside of mainland China? The honest answer is, this is complex enough and there's enough shades of nuance and technical and policy based challenges, that I would strongly recommend consulting with someone who has experienced this before. I'm not that person. I generally try to avoid dealing with complex geopolitical issues when I'm trying to troubleshoot networking issues at the same time. I have nothing to sell you in this context. If you are trying to solve for this problem, do reach out to me on Twitter @quinnypig, or email me, corey@atlastweekinaws.com, and I'll be thrilled to do a little digging for you if I can't come up with another solution by the time this airs.


So in short, the Chinese networking environment is radically different than you're going to find anywhere else on the planet. If you're doing business there, you need to do an awful lot of research, you need to go in prepared and you probably want to have competent legal counsel who understands the intricacies of doing business cross-border in the Chinese market. In short, if this applies to you, good luck because here be dragons.


I'm Corey Quinn. This is the AWS Morning Briefs, Networking In The Cloud podcasting mini series sponsored exclusively by ThousandEyes. My thanks to them for their generous sponsorship. My thanks to you for listening. And as always, please feel free to leave an excellent review in Apple podcasts, whether or not you've actually enjoyed this episode at all. Thanks.


Announcer: This has been a HumblePod production. Stay humble.



Thu, 09 Jan 2020 03:00:00 -0800
Burning Amazon Lex to CD-ROM
AWS Morning Brief for the week of January 6th, 2020.
Mon, 06 Jan 2020 03:00:00 -0800
Listener Mailbag
AWS Morning Brief for the week of December 30th, 2019.
Mon, 30 Dec 2019 03:00:00 -0800
It's a Horrible Lyfebin
AWS Morning Brief for the week of December 23rd, 2019.
Mon, 23 Dec 2019 03:00:00 -0800
Networking in the Cloud Fundamentals: Regions and Availability Zones in AWS

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Transcript
Corey: Hello, and welcome back to our Networking in the Cloud mini series sponsored by ThousandEyes. That's right. ThousandEyes is state-of-the-cloud Performance Benchmark Report is now available for your perusal. It's really providing a lot of baseline that we're taking all of the miniseries information from. It pointed us in a bunch of interesting directions and helps us tell stories that are actually, for a change, backed by data rather than pure sarcasm. To get your copy, visit snark.cloud/realclouds because it only covers real cloud providers. Thanks again to ThousandEyes for their ridiculous support of this shockingly informative podcast mini series.



It's a basic fact of cloud that things break all the time. I've been joking for a while that a big competitive advantage that Microsoft brings to this space is that they have 40 years of experience apologizing for software failures, except that's not really a joke. It's true. There's something to be said for the idea of apologizing to both technical and business people about real or perceived failures being its own skillset, and they have a lot more experience than anyone in this space.



There are two schools of thought around how to avoid having to apologize for service or component failures to your customers. The first is to build super expensive but super durable things, and you can kind of get away with this in typical data center environments right up until you can't, and then it turns out that your SAN just exploded. You're really not diversifying with most SANs. You're just putting all of your eggs in a really expensive basket, and of course, if you're still with power or networking outage, nothing can talk to the SAN, and you're back to square one.



The other approach is to come at it with a perspective of building in redundancy to everything and eliminating single points of failure. That's usually the better path in the cloud. You don't ever want to have a single point of failure if you can reasonably avoid it, so going with multiple everythings starts to make sense to a point. Going with a full on multi-cloud story is a whole separate kettle of nonsense we'll get to another time. But you realize at some point you will have single points of failure and you're not going to be able to solve for that. We still only have one planet going around one sun for example. If either of those things explode, well, computers aren't really anyone's concern anymore. However, betting the entire farm on one EC2 instance is generally something you'll want to avoid if at all possible.



In the world of AWS, there aren't data centers in the way that you or I would contextualize them. Instead, they have constructs known as availability zones and those composed to form a different construct called regions. Presumably, other cloud providers have similar constructs over in non-AWS land, but we're focusing on AWS as implementation in this series, again, because they have a giant multi-year head start over every other cloud provider, and even that manifests in those other cloud providers comparing what they've built and how they operate to AWS. If that upsets you and you work at one of those other cloud providers, well, you should have tried harder. Let's dive in to a discussion of data centers, availability zones, and regions today.



Take an empty warehouse and shove it full of server racks. Congratulations. You have built the bare minimum requirement for a data center at its most basic layer. Your primary constraint and why it's a lot harder than it sounds is power, and to a lesser extent, cooling. Computers aren't just crunching numbers, they're also throwing off waste heat. You've got to think an awful lot about how to keep that heat out of the data center.



At some point, you can't shove more capacity into that warehouse-style building just because you can't cool it if it's all running at the same time. If your data center's particularly robust, meaning you didn't cheap out on it, you're going to have different power distribution substations that feed the building from different lines that enter the building at different corners. You're going to see similar things with cooling as well, multiply redundant cooling systems.



One of the big challenges, of course, when dealing with this physical infrastructure is validating that what it says on the diagram is what's actually there in the physical environment. That can be a trickier thing to explore than you would hope. Also, if you have a whole bunch of systems sitting in that warehouse and you take a power outage, well, you have to plan for this thing known as inrush current.



Normally, it's steady state. Computers generally draw a known quantity of power. But when you first turn them on, if you've ever dealt with data center servers, the first thing they do is they power up everything to self-test. They sound like a jet fighter taking off as all the fans spin up. If you're not careful, and all these things turn on at once, you'll see a giant power spike that winds up causing issues, blowing breakers, maxing out consumption, so having a staggered start becomes a concern as well. Having spent too much time in data centers, I am painfully familiar with this problem of how you safely and sanely recover from site-wide events, but that's a bit out of scope, thankfully, because in the cloud, this is less of a problem.



Let's talk about the internet and getting connectivity to these things. This is the Networking in the Cloud podcast after all. You're ideally going to have multiple providers running fiber lines to that data center hoping to avoid fiber's natural predator, the noble backhoe. Now, ideally, all those fiber lines are going over different paths, but again, hard thing to prove, so doing your homework's important, but here's something folks don't always consider: If you have a hundred gigabit ethernet links to each computer, which is not cheap, but doable, and then you have 20 servers in a rack, each rack theoretically needs to be able to speak at least two terabit at all times to each other server in each other rack, and most of them can't do that. They wind up having bottle-necking issues.



As a result, when you have high-traffic applications speaking between systems, you need to make sure that they're aware of something known as rack affinity. In other words, are there bottlenecks between these systems, and how do you minimize those to make sure the crosstalk works responsibly? There are a lot of dragons in here, but let's hand-wave past all of it because we're talking about cloud here. The point of this is that there's an awful lot of nuance to running data centers, and AWS and other large cloud providers do a better job of it than you do. That's not me insulting your data center staff. That's just a fact. They have the scale and the staff and the expertise of running these things operationally that very few other companies are going to be able to touch.



Sure, if you're Facebook, you probably have some expertise in this as well, and a lot of this won't apply to you, but you know that already. If you're wondering whether what I'm talking about here applies to your environment, unless you know for a fact it doesn't, it does. Assume that. This impacts your approach in the cloud, the networking, durability, and the concept of blast radius, and the forms that AWS gives us that wrap these concepts for us are twofold, and I want to cover them today: regions and availability zones.



An availability zone, or A-Zee, or A-Zed if you're not in the United States, is effectively a set of data centers, and yes, that's plural. It's not just different racks with different power buses in the same room. AWS tries to guarantee that there is no shared power, network, or control plane between availability zones, but you can expect some issues to impact an entire availability zone as a result. Ergo, if you're building something important, you're going to want it to be in at least multiple availability zones.



To that end, here's a fun fact that trips up nearly everyone the first time they see it. If you have an AWS account, you might see that there's an outage in a particular availability zone, us-west-2a. Meanwhile, in my account, I see an outage in us-west-2c. Who's right in that scenario? Well, we both are because availability zone names aren't consistent between AWS accounts.



Relatively recently, about a year ago as of this recording, they announced a zone ID that is consistent between accounts, but people still don't talk about it in those terms. They're still talking with the old style region us-west-2 followed by a letter for the availability zone. You still have to disambiguate those back to zone IDs with an extra step. That also doesn't solve the problem for you because note that even with indirect issues that you're seeing, they can still impact other availability zones, even with a completely separate control plane because if you have things that are running in two availability zones for your application, and one of those availability zones drops off the internet for a while, you're suddenly seeing twice the load in the availability zone that's still working. You're also probably not the only customer that has planned for this and has built out in multiple availability zones, so other folks are going to be seeing the exact same type of behavior.



As a result, failures will then cascade and manifest as slow performance in the good availability zones, and it's super hard to plan for. It's also super hard to detect, which brings us back to our sponsor, ThousandEyes. ThousandEyes provides a global observer perspective on what's going on internet-wide with a bunch of different providers. It helps answer the question when one of these incidents hits of "is it my code, is it the last deployment that we did, or is there something global that's causing this problem?"



ThousandEyes provides that global observer perspective that helps you figure out immediately was it your code or was it something infrastructure-based because if it's not your code and it is infrastructure-based, suddenly, you can stop looking at everything you just shipped to production and instead look at mitigating this according to an established DR plan. Thanks again to ThousandEyes for sponsoring this. To learn more, visit thousandeyes.com, and tell them Corey sent you. They seem to like me for some reason that we really can't tell.



You finally build something that's in multiple availability zones, but as mentioned, that cascade effect can be a challenge, so this is where we get into the idea of multiple regions. A region is two or more availability zones, usually three or more, but there are some legacy stories behind that, and those are separated by very large distances. In the United States, for example, there's one in Oregon, one in Ohio, one in Virginia, and one, sort of, in Northern California. The challenge, of course, is building applications that work spanning multiple regions, and there are a couple of issues with this. If you're trying to tilt at the windmill of multi-cloud, I would strongly encourage you to start by going multiple region in a single provider first.



This removes a lot of the finicky bits of multi-cloud, like the services work differently, and if you pick the right regions, you'll have a one-to-one affinity between all of the different services, and that's awesome. Once you wind up getting multiple regions online, then you start to see a lot of the challenges with this approach. Different things experience latency very differently.



There's also data transfer cost to consider. Whether data is traversing between regions or between availability zones, it does incur an additional cost, so anything with a high replication factor is going to be of some concern. We'll talk specifically about data transfer costs in a future episode. Additionally, if you're in a single provider and going multiple regions, they do have dedicated links between their regions that usually wind up providing better performance and faster speeds than you're going to see traversing the general internet, but see the previous episode on global accelerator to figure out a little bit more about some of the caveats there.



One thing to also consider is that because AWS does have a severed control plane that does not extend to multiple regions, there are two things that this impacts. The first is that we have never yet seen a networking event that traverses more than one region. The counterpoint is that not all services are available in all regions, so make sure that you wind up selecting appropriate regions based upon their region service availability table.



Further, you're also going to want to make sure that the pricing aligns with it. The region in Northern California, for example, doesn't have nearly as many availability zones as the rest, and everything in that region does tend to cost more, so pay attention to that. There are two other regions that were announced, or region-like things that were announced recently at AWS re:Invent.



The first is the local zone, which is a different type of availability zone. They only have one so far. It is generally available in preview, which means that words no longer have the same meaning anymore, and it's an extension into Los Angeles of the region based in Oregon, us-west-2. This enables companies and other organizations in Los Angeles to have lower latency access to AWS resources when effectively tens of milliseconds or less matter for certain workloads.



It's fascinating, but it does suffer from a lack of durability that you're going to see in a fully baked region, so use it if you have to, but if you can avoid it, you're potentially saving yourself some ops burden down the road. They've also taken their outposts, which are fundamentally just racks full of AWS equipment that you can now rent and put in your facility, and done some partnerships with cell companies. In the United States, they've started with Verizon, and they're exploring 5G and calling this AWS wavelength. This is relevant if and only if you're looking at building 5G type applications in partnership with Verizon. Most folks aren't, so it's not going to be super relevant, but it is a type of global infrastructure to pay attention to.



Fundamentally, understanding the differences between regions and availability zones in AWS, or their equivalent in other providers, is going to be critical for planning for your DR type of tests. It's super unfortunate to wind up testing your DR plan and finding everything works, and then testing it during an actual... and then using it during an actual outage and discovering everything's slow and the provisioning takes forever because everyone has the same plan that you do. Take some time, make sure that you understand these region and availability zone concepts when you're building out your infrastructure plan, and ideally, everything goes a lot more smoothly for you.



That's all I've got to say on this particular topic. If you have questions, please feel free to ask them. On Twitter, I'm QuinnyPig. That's Q-U-I, double N, Y Pig, and I'll do my best to either answer them myself or point you to someone smart who can answer them more authoritatively. If you've enjoyed this podcast, please leave a five-star review in Apple Podcasts. If you've hated this podcast, please leave a five-star review in Apple Podcasts and a funny comment, so I have something to laugh at while crying. I'm cloud economist Corey Quinn, and I'll talk to you more next week.



Announcer: This is been a HumblePod production. Stay humble.


Thu, 19 Dec 2019 03:00:00 -0800
AWS Dep-Ric-ates Treasured Offering
AWS Morning Brief for the week of December 16th, 2019.
Mon, 16 Dec 2019 03:00:00 -0800
reInvent Wrap-up, Part 4
AWS Morning Brief for Friday, December 13th, 2019
Fri, 13 Dec 2019 03:00:00 -0800
Networking in the Cloud Fundamentals, Part 6

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Transcript
Corey: Knock knock. Who's there? A DDOS attack. A DDOS a... Knock. Knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock.


Welcome to what we're calling Networking in the Cloud, episodes six, How Things Break in the Cloud, sponsored by ThousandEyes. ThousandEyes recently launched their state of the cloud performance benchmark report that effectively lets you compare and contrast performance and other aspects between the five large cloud providers, AWS, Azure, GCP, Alibaba and IBM cloud. Oracle cloud was not invited because we are talking about real clouds here. You can get your copy of this report at snark.cloud/realclouds. and they compare and contrast an awful lot of interesting things. One thing that we're not going to compare and contrast though, because of my own personal beliefs, is the outages of different cloud providers.


Making people in companies, by the way, companies are composed of people, making them feel crappy about their downtime is mean, first off. Secondly, if companies are shamed for outages, it in turn makes it far likelier that they won't disclose having suffered an outage. And when companies talk about their outages in constructive blameless ways, there are incredibly valuable lessons that we all can learn from it. So let's dive into this a bit.


If there's one thing that computers do well, better than almost anything else, it's break. And this is, and I'm not being sarcastic when I say this, a significant edge that Microsoft has when they come to cloud. They have 40 some odd years of experience in apologizing for software failures. That's not trying to be insulting to Microsoft, it's what computers do, they break. And being able to explain that intelligently to business stakeholders is incredibly important. They're masters at that. They also have a 20 year head start on everyone else in the space. What makes this interesting and useful is that in the cloud, computers break differently than people would expect them to in a non-cloud environment.


Once upon a time when you were running servers and data centers, if you see everything suddenly go offline, you have some options. You can call the data center directly to see if someone cut the fiber, in case you were unaware of fiber optic cables' sole natural predator in the food chain is the mighty backhoe. So maybe something backhoed out some fiber lines, maybe the power is dead to the data center, maybe the entire thing exploded, burst into flames and burned to the ground, but you can call people. In the cloud, it doesn't work that way. Here in the cloud, instead you check Twitter because it's 3:00 AM and Nagios is the original call of duty or PagerDuty calls you, because you didn't need that sleep anyway, telling you there is something amiss with your site. So when a large bond provider takes an outage, and you're hanging out on Twitter at two in the morning, you can see DevOps Twitter come to life in the middle of the night, as they chatter back and forth.


And incidentally, if that's you, understand a nuance of AWS availability zone naming. When people say things like us-east-1a is having a problem and someone else says, "No, I just see us-east-1c is having a problem," you're probably talking about the same availability zone. Those letters change, non deterministically, between accounts. You can pull zone IDs, and those are consistent. But by and large, that was originally to avoid having problems like everyone picking A, as humans tend to do or C, getting the reputation as the crappy one.


So why would you check Twitter to figure out if your cloud provider's having a massive outage? Well, because honestly, the AWS status page is completely full of lies and gaslights you. It is as green as the healthiest Christmas tree you can imagine, even when things are exploding for a disturbingly long period of time. If you visit the website, stop.lying.cloud, you'll find a Lambda and Edge function that I've put there that cuts out some of the croft, but it's not perfect. And the reason behind this, after I gave them a bit too much crap one day and I got a phone call that started with, "Now you listen here," it turns out that there are humans in the loop, and they need to validate that there is in fact a systemic issue at AWS and what that issue might be, and then finally come up with a way to report that in a way that ideally doesn't get people sued and manually update the status page. Meanwhile, your site's on fire. So that is a trailing function, not a leading function.


Alternately, you could always check ThousandEyes. That's right, this episode is sponsored by ThousandEyes. In addition to the report we mentioned earlier, you can think of them as Google Maps of the internet without the creepy privacy overreach issues. Just like you wouldn't necessarily want to commute during rush hour without checking where traffic is going to be and which route was faster, businesses rely on ThousandEyes to see the end to end paths their applications and services are taking in real time to identify where the slow downs are, where the outages are and what's causing problems. They use ThousandEyes to see what's breaking where and then importantly, ThousandEyes shares that data directly with the offending service providers. Not just to hold them accountable, but also to get them to fix the issue fast. Ideally, before it impacts users. But on this episode, it already has.


So let's say that you don't have the good sense to pay for ThousandEyes or you're not on Twitter, for whatever reason, watching people flail around helplessly trying to figure out what's going on. Instead, you're now trying desperately to figure out whether this issue is the last deploy your team did or if it's a global problem. And the first thing people try to do in the event of an issue is, "Oh crap, what did we just change? undo it." And often that is a knee jerk response that can make things worse if it's not actually your code that caused the problem. Worse, it can eat up precious time at the beginning of an outage. If you knew that it was a single availability zone or an entire AWS region that was having a problem, you could instead be working to fail over to a different location instead of wasting valuable incident retime checking Twitter or looking over your last 200 commits.


Part of the problem, and the reason this is the way that it is, is that unlike rusting computers in your data center currently being savaged by raccoons, things in the cloud break differently. You don't have the same diagnostic tools, you don't have the same level of visibility into what the hardware is doing, and the behaviors themselves are radically different. I have a half dozen tips and tricks on how to monitor whether or not your data center's experiencing a problem remotely, but they don't work in the cloud because you're not allowed to break into us-east-1 and install your own hardware. Believe me, I've tried. I still have the scars to prove it. Instead, you have to deal with this problem of behaviors looking differently.


For example, sometimes you can talk to one set of servers but the other is completely non-responsive to you, but those two server sets can still talk to one another intermittently. So you wind up with each one of them thinking at times they're the only ones there or you can talk to both of them but they can't talk to each other. There are different kinds of failures and they all look slightly different. Occasionally, it looks like slow API responses. Latencies are increasing. Well, that's an awfully nice way to say that suddenly your database doesn't. It often can look like a certain subset of systems that seem slow or intermittent. Remember as well that availability zones aren't multiple buildings. It's not just one room with different racks being called different AZs, the way we used to do things badly in crappy data center land. It's super hard to take out 20 square blocks and cause multiple AZ outages at the same time, at least it is with that attitude.


So instead of automatically assuming that, "Well, it works for me on this other account, so things are fine," dig deeper into it. Often issues in 1AZ have cascading effects and you see other popular sites on the internet starting to have problems. Maybe it's not just you. The fact that this is sort of state of the art for monitoring these issues is a separate issue. The problem comes in when people haven't changed their thinking to reflect this new cloud reality.


There's no better example of this than DR exercises or disaster recovery. Now, most ops folks, and I still sort of count myself as one, have tremendous levels of experience with disasters, planning for disasters, recovering from disasters, and notable cases causing disasters. The problem is is that very often stories about how to handle disasters don't work in the real world. An easy example is if you're running in us-east-1 and your disaster recovery approach is, "Oh, we're just going to spin up the site in us-west-2 in Oregon, great." Now there are problems with that approach, but let's skip over a few of them and get to the interesting ones.


First, if you're doing this during a test and you spin up a bunch of east-2 instances or other services, great, that's going to probably work super well for purposes of your test. The challenge of course is that when you're in the middle of an actual disaster, you are not the only person who has that strategy in mind for how they're going to handle the disaster. So suddenly us-west-2 and other regions, I don't mean to pick on Oregon in particular, are going to suffer from inrush issues. Very often that means that the API calls to the control plane in the cloud, wind up becoming impacted, latencies start to increase. There have been scenarios in the past where it takes up to an hour to have instances come online after you've requested them. So if you need to have an active DR site ready to go, you have to pay for the durability for those instances and other services to already be up and running.


Secondly, if you're like most jobs, you'll test your DR site every quarter or every year, and you'll find at the first past that, "Oh, it didn't work, it broke immediately." So you go back, you fix the thing, you try it again, and it breaks differently. And after enough of this, you finally beat something together and it works and you call it done. You put the binder on the shelf where no one will ever read it again and everything is just fine until the next commit breaks your DR plan again. And the problem is is that that's in the best of times where there's no actual disaster. Trying to make that work in any reasonable approach during a disaster in the middle of the night, where not everyone's firing on all cylinders, that becomes a problem.


I also strongly suggest that you don't approach business continuity planning, or BCP, the way that I did, and it's why I stopped being invited to those meetings. The problem that we ran into was, "Okay, let's pretend for the sake of argument that San Francisco is no longer able to conduct business," to which my immediate response is, "Oh dear heavens, is my family okay?" "Yes, yes, your family's fine, everyone's fine, but magically we can't do any computer work." Okay, I struggled to identify that, but all right, let's pretend I care that much about my job and not about my family. Cool. I understand everyone's family relationships are different and for some folks that works.


All right, next step. Simultaneously, us-east-1 is completely unusable. "Okay, so let me get this straight, not only is San Francisco now magically not usable, but also roughly a hundred square miles of Northern Virginia is also completely unusable. And at this point, I'm not hunkering down in a basement cowering, waiting for the end of days because why exactly?" And the response was "Just roll with it, it'll be fine. Now, we need to have a facility outside of the city for you to go to and in a different provider, have all the backups, you can rehydrate this new. And at the end of that project, we're going to be able to do this whenever we need to." At which point I stared at people for the longest time and said, "You get that we sell ads here, right? And furthermore, let's pretend that everything you say is true, us-east-1 is irreparably damaged and I don't want to spend time with my family in a disaster like that because everyone's fine. Why do I still work here rather than going to make extortionate money as a consultant somewhere else who is not prepared nearly as well as we have?" And then I wasn't invited to those meetings anymore.


One last angle that people tend to approach this stuff from is the idea of, well ,the service needs an SLA or service level agreement. Some AWS services have them, some do not. But they don't mean what you think they do. Route 53 famously has a 100% SLA. If they don't meet that, first, they owe you some small portion of your route 53 bill which, spoiler, is probably not a large pile of money. Secondly, because they've published, that everything else, including a number of AWS services themselves, almost certainly build to that SLA. So it breaks, they owe you some small pile of money, but when that outage hits, because everything breaks, it's what they do, it impacts your site. No, you can't trust various SLA metrics as statements that services will never go down. You own your own availability. You can't outsource the responsibility of that to third parties, no matter how much you might want to.


It may sound like I'm suggesting that things in the cloud always break and that you shouldn't be in the cloud at all if you can't withstand an outage. I strongly disagree. There are reasons to stay with a cloud provider. First, they're going to diagnose and fix the problem with a far larger staff who is far better equipped to handle these issues, then you'll be able to independently in almost every case.


Secondly, if there's a massive disruption to a public cloud provider, then you're going to be in good company. The headlines are not going to be about your company's outage, they're going to be about the cloud provider. There's some reputational risks that gets mitigated as a direct result.


Finally, if all of that fails and you still go down and everyone makes fun of you for it, well, you can always go for consolation on Twitter.


This has been another episode of Networking in the Cloud. I'm cloud economist, Corey Quinn. Thank you for joining us and we'll talk soon. Thanks again to ThousandEyes for their sponsorship of this ridiculous podcast.


Announcer: This has been a HumblePod production. Stay humble.


Thu, 12 Dec 2019 03:00:00 -0800
reInvent Wrap-up, Part 3
AWS Morning Brief for Wednesday, December 11th, 2019
Wed, 11 Dec 2019 03:00:00 -0800
reInvent Wrap-up, Part 2
AWS Morning Brief for Tuesday, December 10th, 2019.
Tue, 10 Dec 2019 03:00:00 -0800
reInvent Wrap-up, Part 1
AWS Morning Brief for the week of December 9th, 2019.
Mon, 09 Dec 2019 03:00:00 -0800
Wherever You May Rome
AWS Morning Brief for the week of December 2nd, 2019.
Mon, 02 Dec 2019 03:00:00 -0800
Networking in the Cloud Fundamentals, Part 5

About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Transcript
Corey: As the world spins faster, it heats up because of friction. Therefore, for the good of humanity, the AWS Global Accelerator must be turned off.


Welcome once again to Networking in the Cloud, a 12 week special on the AWS Morning Brief, sponsored by ThousandEyes. Think of ThousandEyes as the Google Maps of the internet without the creepy privacy implications. Just like you wouldn't necessarily go from one place to another without checking which route was less congested during rush hour, businesses rely on ThousandEyes to see the end to end paths that their applications and services are taking, from their servers, to their end users, or between other servers, just to identify where the slow downs are, where the pile ups live, and what's causing various issues. They use ThousandEyes to see what's breaking where and then of course depend upon ThousandEyes to share that data directly with the offending providers, to shame them into accountability and get them to fix the issue. Learn more at thousandeyes.com.


So, today we talk about the Global Accelerator, which is an offering from AWS that they announced at re:Invent last year. What is it? Well, when traffic passes through the internet from your computer on route to a cloud provider, or from your data center to a cloud provider, the provider has choices as to how to route that traffic in. Remember, there's no cloud provider that we're going to be talking about that doesn't have a global presence. So, they have a number of different choices.


Some, such as GCP and Azure, will route that traffic directly into their networks right away, as close to the end user as possible. Others, like AWS and interestingly Alibaba, will have that traffic ride the public internet as long as possible, until it gets to the region that that traffic is aimed at, and then ingested into the provider's network. And, IBM has an interesting hybrid approach between the two of these that doesn't actually matter, because it's IBM Cloud.


Now, Global Accelerator offers a slightly different option here. Because by default, traffic bound to AWS will ride the public internet until it hits the region at the end. That means that traffic is subject to latency based upon public internet congestion. It's subject to non-deterministic latency, as far as leading to... Some packets will get there faster than others, as they take different routes, so jitter becomes a concern.


Global Accelerator sort of flips the behavior on its head, where instead of traveling across the entire internet until it smacks into a region, traffic now winds up landing into AWS's network far sooner, and then rides along AWS's backbone to where it needs to go. And then, it smacks into one of a number of different end points. Today, at the time of this recording, it supports application load balancers, either internal or external, network load balancers, elastic IPs and whatever you can tie those to, and of course EC2 instances, public or private. We'll mention that... The caveat about that a little later on.


On the other side, to the internet, what happens is that Global Accelerator gives out two IP addresses that are Anycast. What that means is using BGP, those are generally repointed to the closest supported region to the customer. As a result, they can do a lot of changes to network architecture in completely invisible ways to the end user. It supports, for example, shifting traffic to different regions or endpoints. It can shape how that traffic winds up manifesting on the fly.


So, other ways of managing this such as using DNS, means that suddenly you don't have high TTLs anymore on the client side. That mean the traffic doesn't shift as closely as you'd like, and IP caching as well once that DNS record is resolved, no longer applies. You see this all over the place with, for example, public DNS resolvers. The same IP addresses are what people use globally to talk to, well known DNS resolvers, but strangely it's always super quick and not traveling across the entire internet. Imagine that.


This is similar in some ways to AWS's CloudFront service. CloudFront is, as mentioned, a CDN that has somewhat similar performance characteristics. It generally winds up being a slightly better answer when you're using a protocol like HTTP or HTTPS that the entire CDN service has been designed around. They have a whole bunch of locations that are scattered across the globe, and sure it takes a year and a day to update a distribution or deploy a new one in CloudFront, but that's not really the point of this comparison here.


Where Global Accelerator shines, is where you have non HTTP traffic, or you need that super responsive failover behavior. You have a lot more control with Global Accelerator as well. So if for example, data processing location is super important for you due to regulatory requirements, it's definitely worth highlighting that Global Accelerator does grant additional flexibility here. But it's not all sunshine and roses.


There are some performance metrics that shine interesting lights on this. Where do those performance metrics come from, you might wonder? Well, I'm glad you asked. They come from the ThousandEyes state of the cloud performance benchmark report. As mentioned previously, they wound up doing a whole series of tests across a whole variety of different cloud providers from different networks, that in turn wind up showcasing where certain cloud providers shine, where certain cloud providers don't necessarily work as well in some context as others do, and more or less, for lack of a better term, let you race the clouds. It's one of the fun things that they're able to do because they serve the role of global observer. They have a whole bunch of locations where they can monitor from, and they see customer traffic so they understand what those use cases look like in real life.


Feel free to get your copy of the report today. They race, GCP, Azure, AWS, Alibaba, and IBM Cloud. As mentioned on previous episodes, Oracle Cloud was not included because they use real clouds. You get your copy today at snark.cloud/realclouds, that's snark.cloud/realclouds and thanks again to ThousandEyes for their continuing support of this ridiculous mini series. Now, what did ThousandEyes learn? Well, this should be blindingly obvious, but in case it's not, the Global Accelerator is not super useful if you and your customers aren't far apart.


An example that came up in the report was that if you're in North America, which by and large has decent internet connectivity provided you're not somewhere rural due to a variety of terrible things, we'll get to in a future episode, then it's not going to be super useful for you. You're generally, as far as the internet is concerned, relatively close to an awful lot of AWS regions in North America. We're talking tens of milliseconds in most cases.


So if your customers are right next to an AWS region, then you're not really going to see a whole lot of benefit from a tool like the AWS Global Accelerator. Now, not everyone lives in San Francisco, it turns out. So, if you have users, customers, et cetera, scattered around the world in far flung places, then it turns out that something like the Global Accelerator can absolutely add some benefits.


It has the ability to meaningfully change some of the latency and consistency metrics more effectively the further out into the world, and across the unstable internet, customers are from your regions. Now in a couple of edge cases, and this is contested of course, but notably one ISP in India, the Global Accelerator performed actively worse than the general internet did in a series of tests. There is some nuance to this and I understand why people are saying, wow, hold on there, but the methodology is largely sound.


There's always going to be concerns with various networks and how they peer with other networks. In practice though, there really is only one solid takeaway from this. And that is that, if you're going to be using the AWS Global Accelerator for actual customers rather than Black-Box benchmarking, that you really don't want to tell the provider that you're doing in advance, then you're going to want to be sure that you reach out to AWS to let them know what you're up to before you turn it on. They do have knobs and dials on their side that they can adjust to control things, and of course figure out what their actual customers are up to. Most cloud providers worth talking to, tend to optimize for customer satisfaction not benchmark satisfaction. So that said, as mentioned, it's not all sunshine and roses here.


So when is the AWS Global Accelerator not going to work out super well? Let's talk about some caveats. For one, and this is an edge case, but it is worth highlighting. As I mentioned earlier, the Global Accelerator can be used to talk from the internet to an EC2 instance that lives in a private subnet, provided there's an internet gateway hooked up to that VPC. Now that's a big deal because almost everyone's security policies assume that that is not a situation that's ever going to happen. Well, welcome to reality because that just changed.


If you're deploying Global Accelerator, make very sure that your security policies align with that. It's also worth pointing out that from the time that they ran these tests a month and a half ago or so to now, there have been significant regional availabilities announced for Global Accelerator. It's always a moving target trying to do any kind of review, or analysis of an AWS offering, particularly something as broad as a globally distributed networking approach, but it's worth at least paying attention to that they are evolving rapidly. So, understand that limitations today are possibly not going to apply tomorrow.


Pay attention. The things that we learn and know for a fact about computers or anything really, we don't tend to go back and reevaluate those later in life. Once we learn something, we stop reevaluating it. And we all fall victim to that. AWS does not hold still for better or worse. One of my personal pet peeves about this, is that pricing is non-deterministic. Now, what do I mean by that? How much Global Accelerator is going to cost you? Well, first there's a fixed fee per hour that it runs. Fine, whatever. Great. We're used to that. In this case, virtually no one cares because it's two and a half cents an hour. That's not what I'm complaining about or particularly concerned by. The problem is that there's now a data transfer premium fee on a per gigabyte basis that is transferred over the AWS network.


Now, how is that determined? Glad you asked. You're not going to like the answer. The DT premium rate depends on the AWS or region that serves the request, and the AWS edge location where the responses are directed. You're only going to be charged that premium fee in and the dominant data transfer direction, but note that fee is on top of the existing data transfer pricing as well. The pricing at retail rate goes as low as one and a half cents per gigabyte, but in some regions and between others, you can see things that are approaching almost seven cents a piece, eight cents a piece in a couple of them, and 10 and a half cents in others, where there is significant costs to driving this.


But my problem isn't that it's expensive, and my problem isn't that this pricing is inherently unfair. My concern is that it is effectively impossible to learn what this is going to cost you for any reasonable estimation, until you test it on your own traffic patterns and find out. This is one of the big problems I have as a cloud economist whenever I make fun of the cloud. It's almost impossible to figure things out in advance. Pricing wise, test it, see it, if it's that big of a surprise, cry, beg for mercy, and hope you get a refund. Other caveats include, that CloudFormation is not super well supportive of Global Accelerator even at the time of this recording, because why would it be? You're not allowed to talk to each other if you work on different AWS service teams.


And of course, my last real caveat for this, and it's just an annoyance, but as I mentioned at the beginning of this episode, you get some very similar behavior from GCP and Azure for free by using GCP or Azure. AWS is charging you a premium for this because AWS... And in return for that, I would expect to see significant functionality delivered as a result. For an awful lot of use cases, I don't think it's quite there yet. For your use case, it might be. So don't take this as a condemnation of the service, take it as... Or with almost anything else AWS or other providers release, investigate further.


So, the takeaway here fundamentally is that your results are going to vary wildly. And the variables aren't lengthy. What regions your in, what networks your customers are coming from, what your traffic looks like, that's going to drive the cost. And the only way you're going to get answers for this, is to test it, and see how it performs for your use case.


Now, the Global Accelerator is not a panacea but it very well could help with some specific use cases. It might also cost a King's ransom and in a few edge cases make things actively worse, but that's why we test, and that's why we talk to AWS when we're doing things like this. That said, it's worth keeping an eye on the Global Accelerator as it continues to evolve. To learn more about the product, you can type AWS Global Accelerator into your search engine of choice, and then go slowly mad with frustration as it takes forever to return a result because of the slow internet.


This has been another episode of what I'm calling networking in the cloud. Thanks again to ThousandEyes for making it possible. We will not be having a Networking in the Cloud episode next week, because we will be busy with other things in the wake of the disaster that is re:Invent. If you're looking for more content, there are plenty of other places to go, just not here for one week. I'm cloud economist Corey Quinn. Thank you for listening, and I will talk to you in two weeks.


Announcer: This has been a HumblePod production. Stay humble.

Thu, 28 Nov 2019 03:00:00 -0800
Improving Customers by Stuffing Them Into Containers
AWS Morning Brief for the week of November 25th, 2019.
Mon, 25 Nov 2019 03:00:00 -0800
Networking in the Cloud Fundamentals, Part 4

About Corey Quinn

Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.

Transcript

An IPv6 packet walks into a bar. Nobody talks to it.


Welcome back to what we're calling a networking in the cloud, a 12 week networking extravaganza sponsored by ThousandEyes. You can think of ThousandEyes as the Google maps of the internet. Just like you wouldn't dare leave San Jose to drive to San Francisco without checking to see if the 101 or the 280 was faster, businesses rely on ThousandEyes to see the end to end pads their apps and services are taking and for localized traffic stories that mean nothing to people outside of the Bay Area. This enables companies to figure out where are the slowdowns happening, where are the pile ups and what's causing issues. They use ThousandEyes to see what's breaking where, and importantly they share that data directly with the offending service providers to hold them accountable in a blameless way and get them to fix the issue fast, ideally before it impacts their end users.


Learn more at thousandeyes.com. And my thanks to them for sponsoring this ridiculous podcast mini-series.


This week we're talking about load balancers. They generally do one thing and that's balancing load, but let's back up. Let's say that you, against all odds, you have a website and that website is generally built on a computer. You want to share that website with the world, so you put that computer on the internet. Computers are weak and frail and often fall over invariably at the worst possible time. They're herd animals. They're much more comfortable together. And of course, we've heard of animals. We see some right over there.


So now you have a herd of computers that are working together to serve your website. The problem now of course, is that you have a bunch of computers serving your website. No one is going to want to go to www6023.twitterforpets.com to view your site. They want to have a unified address that just gets to wherever it has to happen. Exposing those implementation details to customers never goes well.


Amusingly, if you go to Deloitte, the giant consultancy's website, the entire thing lives at www2.deloitte.com. But I digress. Nothing says we're having trouble with digital transformation quite so succinctly.


So you have your special computer or series of computers now that live in front of the computers that are serving your website. That's where you wind up pointing twitterforpets.com to, or www.twitterforpets.com towards. Those computers are specialized and they're called load balancers because that's exactly what they do; they balance load, it says so right there on the tin. They pass out incoming web traffic to the servers behind the load balancer so that those servers can handle your website while the load balancer just handles being the front door that traffic shows up through.


This unlocks a world of amazing possibilities. You can now, for example, update your website or patch the servers without taking your website down with a back in five minutes sign on the front it. You can test new deployments with entire separate fleets of servers. This is often called a blue green deploy or a red black deploy, but that's not the important part of the story. But you can start bleeding off traffic to the new fleet and, "Oh my god, turn it off, turn it off, turn it off. We were terribly wrong. The upgrade breaks everything." But you can do that; turn traffic on, turn traffic off to certain versions and see what happens.


Load balancers are simple in concept but they're doing increasingly complicated things. For instance, you're a load balancer. How do you determine which of the 200 servers that you're in front of that all do the same thing because they have the same website and the same application code running on them, how do you determine which one of those receives the next incoming request?


There are a few patterns that are common. The first and maybe the simplest is called round robin. You'll also see this referred to as next in loop. Let's say you have four web servers. Your first request goes to server one. Your second request goes to server two. Server three and server four, and the fifth request goes back to server one. It just rotates through the servers in order and passes out requests as they commit.


This can work super well for some use cases, but it does have some challenges. For example, if one of those servers get stuck or overloaded, piling more traffic onto it is very rarely going to be the right call. A modification of round robin is known as weighted round robin, which works more or less the same way, but it's smarter. Certain servers can get different percentages of the traffic.


Some servers, for example across a wide variety of fleets can be larger than others and can consequently handle more load. Other servers are going to have a new version of your software or your website and you only want to test that on 1% of your traffic to make sure that there's nothing horrifying that breaks things because you'd fundamentally rather break things for 1% of your users then 100% of your users. Ideally you'd like to break things for 0% of your users, but let's keep this shit semi-real, shall we?


You can also go with the least loaded metric type of approach. Some smarter load balancers can query each backend server or service that they're talking to about its health and get back a metric of some kind. If you wire logic into your application where it says how ready it is to take additional traffic, load balancers can then start making intelligent determinations as to which server to drop traffic onto next.


Probably one of the worst methods you can use to determine how to pass out traffic to load balancers is random, which does exactly what you'd think because randomness isn't. There's invariably going to be clusters and hotspots and the entire reason you have a load balancer is to not have to deal with hot spots; one server's overloaded and screaming while the one next to it is bored, wondering what the point of all of this is.


There are other approaches too that offer more deterministic ways of sending traffic over to specific servers. For example, taking the source IP address that a connection is coming from and hashing that. You can do the same type of thing with specific URLs where the hash of a given URL winds up going to specific backend services.


Why would you necessarily want to do that? Well, in an ideal world, each of those servers is completely stateless and each one can handle your request as well as any others. Here in the real world, things are seldom that clean. You'll find yourself very often with state living inside of your application. So if you have a backend server that handles your first request and then your next request goes to a different backend server, you could be prompted to log in again and that becomes really unpleasant for the end user experience.


The better approach generally is to abstract that login session into something else like Elasticache or Redis or Memcached D or Route 53. But there's a lot of ways to skin that cat that are all out of topic. But some sites do indeed use a hashing algorithm to deterministically drive the same connection to the same server. This is known incidentally as sticky sessions. The idea being that you want to make sure that you have the same server handling each request from a given client. It's not ideal, but being able to ensure that persistence is important to some workloads and I'm not going to sit here casting blame at all of them, just some of them, you know who you are.


And there are a few other approaches too that we're not going to go too far into. You can, for example, least connections; whichever server currently has the least number of active connections drive traffic there. That could cause problems when something has just been turned on and is just spinning up and suddenly it gets a bunch of traffic dropped on before it's ready.


And of course the worst of all worlds is fastest to respond, where you send a connection request to all of the servers and the first one to respond winds up winning it. That is a terrific way to wind up incentivizing all your servers to compete against one another. Try that on employees and let me know how that one goes before trying it on computers.


Now, none of those approaches want to drive traffic to servers that are unhealthy so they'll perform what are known as health checks. In other words, every 5 seconds or 30 seconds or however so often it's configured. You will see a load balancer doing a health check on all of its listening instances. Now, the fun part there is those health checks show up in the logs as the load balancer tries to validate continually that those instances are ready to receive traffic. If it's polling for specific metrics about how ready it is, that can be a little heavier. But one of the more annoying parts is if you look at your server logs for a relatively un-trafficked site, you'll see that the vast majority of your log data ends up being load balancer health checks, which is not just annoying, but it also becomes super expensive if you're paying a service to ingest your logs.


This message is sponsored by Splunk. I'm just kidding. It's sponsored by ThousandEyes who does not charge you for log ingest. In fact, they're not charging you at all for last week's state of the cloud performance benchmark report. We've talked about this in recent weeks, but it's still there. It is now public. You can get your own copy. We'll be talking about aspects of it in the coming weeks.


But they took five production tier clouds, AWS, Azure, GCP, Alibaba and IBM Cloud. Oracle Cloud was again not invited because they only tested this with real clouds. To get your free copy of the report, visit snark.cloud/realclouds. That's snark.cloud/realclouds. And my thanks once again to ThousandEyes for sponsoring this ridiculous mini-series podcast.


So that more or less covers how load balancing in a given region tends to work. Let's talk about global load balancing for a bit. Just because individual computers are fragile, individual data centers or individual cloud regions are also fragile in interesting ways. If you try and build a super redundant localized data center, very often the number one cause of outages is the redundancy stuff that you've built. Two different things are each convinced that they're now the one in charge and they wind up effectively yanking the rug out from each other. There's a whole series of failure modes there that are awful.


For things that are sufficiently valuable to the business you don't want to be dependent on any one facility or any one region. So the idea is you want to have something that balances load globally. Now, often you're going to use something like DNS or Anycast to wind up routing to various environments. Usually those environments are themselves offering up load balancers again that then in turn passes it out to individual servers.


The problem of course for doing anything on a localized basis that also works globally, things like DNS or Anycast wind up being subject to lag. It can also be subject to caching depending on how it works. So you're not going to be able to quickly turn off a malfunctioning region, but you don't generally have to move as quickly for that as you do for a single malfunctioning server. So usually a mix of approach is the right answer.


Let's talk specifically about what AWS offers in this space because once again, they are the 800 pound gorilla in the cloud space. If this offends you and you'd rather we talk about a different cloud provider, well, that cloud provider is welcome to start gaining market share to the point where they're the big player in the space and then we'll talk about them instead.


AWS does offer a few things at a global level. You can use CloudFront which is their CDN and that picks between a number of different origins based upon a variety of factors. Route 53's DNS offering when not being used as a database with my misuse approach offers interesting load balancing options as well. Global Accelerator can pick healthy end points where you can terminate your traffic. But after using all of those, once you hit a localized region, you probably want to use something else and the three most common options are all Amazon's Elastic Load Balancing offerings.


Now originally there was just one called, Elastic Load Balancer, ELD, that these days is called ELD Classic, which is because AWS has problems with their marketing team when they try and call it ELD old and busted. It has some limitations. It only scales so far. It requires pre-warming, namely it needs to have load pass through it before it scales up and can handle traffic efficiently. Otherwise, if you drop a whole bunch of traffic on it, when it's not prepared for this and hasn't been pre-warmed, it's response is, "Oh, shit, load," and then TCP terminates on the floor. Everyone's having a bad day when that happens.


So AWS looked at this and saw that it was good and then thought, "Okay, how can we A, create better offerings and B, make the billing suck far more?" They came up with two different offerings. One was the ALB, or Application Load Balancer and the other was NLB, the Network Load Balancer. Those two things split the world and complicated the living hell out of the billing system because instead of the ELD Classics charge model of per hour that you're running a load balancer and per gigabyte of traffic that passes through it, now the new versions of ALDs and NLDS charge per hour and per load balancer capacity unit.


There are five dimensions that comprise a load balancer capacity unit; new connections per system, new connections per minute, sustained traffic over a period of time and a few others. The correct answer to what will this cost me to run behind an NLD or an ALD is nobody freaking knows, try it yourself and see.


Now, NLDs are all these are fascinating because they claim that they don't need to be pre-warmed, which is awesome. ALDs need a little bit but both need way, way, way less than ELBs did. NLDs are network layer load balancers but also somehow manages to do TLS or SSL termination because the OSI model, which differentiates layer seven of the application layer with layer four which is what NLD does are very different, but the OSI model is a lie that we tell children to confuse them because we thought it was funny.


What the NLD does is it dumps traffic streams onto various places and in turn lets the destination handle it. They now support UDP as opposed to just TCP like they did at launch, so update your mental model. But by and large, if you want to wind up handling everything on instances themselves and just need something to drop the traffic onto them, NLD's a decent approach.


ALD's application load balancers on the other hand do a bunch of things, but mostly it's used to play slap and tickle with HTTPS requests. It can terminate TLS too just like NLDs can because everything's confusing and horrible, but it does a lot more. Specific headers can cause specific routing behaviors. It determines where to route traffic, not just based upon the things we've already talked about in the first part of this episode, but also through a whole bunch of different traffic rules.


You can have a whole bunch of different applications as a result living behind a single ALD and that's often not a terrible direction to go in just from a costing perspective. If you're spinning up a bunch of containerized workloads, you probably don't want to spin up 200 load balancers. Maybe you can do one load balancer and then just give it a bunch of rules that determine which application gets which traffic. It's something to consider in any case.


Now, obviously specializing in this stuff goes way deeper than we have time to cover in a single episode, but fundamentally, load balancers are a simple concept that get deceptively deep, deceptively quickly. Welcome to the entire world of networking in the cloud.


That wraps up what I have to say today about load balancers. Join us next week where I make fun of the AWS Global Accelerator and its horrible contributions to climate change.


Thanks again to ThousandEyes for sponsoring this ridiculous podcast. I am cloud economist Corey Quinn based in San Francisco, fixing AWS bills both here and elsewhere, and I'll talk to you next week.


Announcer: This has been a HumblePod production. Stay humble.

Thu, 21 Nov 2019 03:00:00 -0800
A CloudFormation Feature of Great Import
AWS Morning Brief for the week of November 18th, 2019.
Mon, 18 Nov 2019 03:00:00 -0800
Networking in the Cloud Fundamentals, Part 3

About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Transcript



This episode of Networking in the Cloud is sponsored by ThousandEyes. Their 2019 Cloud Performance Benchmark Report is now live as of yesterday. Find out which Clouds do what well, AWS, Azure, GCP, Alibaba, and IBM Cloud all have their networking capabilities raced against each other. Oracle was not invited, because we are talking about actual Cloud providers here, not law firms. Get your copy of the report today at Snark.Cloud/realclouds. That's Snark.Cloud/realclouds. That's completely free. Download it, let me know what you think. I'll be cribbing from that in future weeks. Now, for the third week of our AWS Morning Brief Screaming in the Network, or whatever we're calling it, mini-series on how computers talk to one another. Let's talk about the larger internet.


Specifically, we begin with BGP, or Border Gateway Protocol. This matters, because it's how different networks talk to one another. If you have a whole bunch of different computer networks gathered into a super network, or internet as some people like to call it, how do those networks know where each one lives? Now, from a home user perspective, or even in some enterprises, that seems like sort of a silly question, because it is. You have a network that lives on your end of things. You plug a single cable in, and every other network lives through that cable. When you're talking about large disparate networks though, how do they find each other? More to the point, because of how the internet was built, it's designed so that any single failure of another network can now be routed around. There are multiple paths to get to different places. Some biased for cost, some biased for performance, some biased for consistency. And all of those decisions have to be made globally. BGP is the lingua franca of how those networks talk to one another. BGP is also a hot mess.


It's the routing protocol that runs the internet, and it's comprised of different networks in this parlance, autonomous systems, or AS's, and it was originally designed for a time before jerks ruled the internet, and that's jerks in terms of people causing grief for others, as well as shady corporate interests that are publicly traded on NASDAQ. There's no authentication tied to BGP. Effectively, it is trusted to contain correct data. There is no real signing or authentication that someone who announces something through BGP is authorized to do it, and it's sort of amazing the whole thing works in the first place, but what happens is, is when a large network with other networks behind it winds up doing an announcement, it says, oh, I have routes to these following networks. And it passes them on to its peers. They in turn pass those announcements on, oh, behind me. Then this way two hops is this other series of networks, and so on and so forth.


Now this can cause hilariously bad problems that occasionally make the front page of the newspaper when a bad announcement gets out. A few years ago there was an announcement from an ISP that said, oh, all of YouTube lives behind us. That announcement should never have gone out, and their upstream ISP should have quashed it, and they didn't. So suddenly a good swath of the internet was trying to reach YouTube through a relatively small link. As you can imagine, TCP terminated on the floor. Not every link can handle exabytes of traffic. Who knew? That gets us to another interesting point. How do these large networks communicate with each other? You have this idea of one network talks to another network. Does money change hands? Well, in some cases, no. If traffic volumes are roughly equal and desirable on both sides, we'll have our networks talk to one another, and no money changes hands. This is commonly known as peering.


At that point, everything is mostly grand, because as traffic continues to climb, you increase the links. Both parties generally wind up paying to operate infrastructure on their own side and in between, and traffic continues to grow. Other times it doesn't work that way where you have one network with a lot of traffic, and another network that doesn't really have much of any, and people want to go from one end to the other. Very often this is known as a transit agreement, and money changes hands from usually the smaller network to the bigger network, but occasionally the other direction depending on the specifics of the business model, and at that point, every byte passing through is metered and generally charged for. Usually this is handled by large ISPs and carriers and businesses behind the scenes, but occasionally it spills out into public view. Comcast and Netflix, for example, have been having a fantastic public spat from time to time, and this manifests itself when there's congestion and you're on Comcast.


If so, I'm sorry for you, and your Netflix stream starts degrading into lower picture quality. Occasionally it's skips or whatnot, and strangely whenever Comcast and Netflix come to an agreement, of course under undisclosed terms, magically these problems go away almost instantly. Originally this sort of thing was frowned upon. The FCC got heavily involved, but with the demise in the United States of network neutrality, suddenly it's okay to start preferring some traffic over others through a legalistic framework, and this has led to a whole bunch of either malfeasant behavior or normal behavior that people believe is malfeasant. And that doesn't leave anyone in a terrifically good place. I'm not here to talk about politics, but it does wind up leading to an interesting place, because there's an existential problem to the business model for an awful lot of ISPs out there. Because generally speaking, when you wind up plugging into your upstream provider, maybe it's Comcast, maybe it's AT&T, maybe it doesn't matter, but you're generally trying to use them as a dumb pipe to the internet.


The problem is, is they don't want to be a dumb pipe. There's a finite number of dollars that everyone is going to pay for access to the internet, and that is a naturally self-limiting business model, so they're trying to add value with services that don't really tend to add much value at all. My wireless carrier for example, wants to sell me free storage, and an email address, and a bunch of other things that I just don't care about, because I already have an email solution that works out super well for me. My Cloud storage that I care about is either Dropbox, something in AWS or other nonsense. I don't need to have Verizon's Cloud storage, but they keep trying to find alternative business models. Some of these ways are useful and beneficial to everyone, and others are well to be honest, less so.


Comcast for example, isn't going to build you a search engine that is going to rival Google, which is kind of weird on some level because if you take a look from a customer service perspective, Comcast and Google are about on equal footing, but they're not going to be able to deliver the kind of user experience from a localized ISP that a lot of the global providers do. So since they're not able to sell value-added services to end users and they're not able to effectively shake down upstream providers, I mean can you imagine if you had to pay Comcast extra to access Google, or if magically YouTube was not accessible through one ISP? People would storm their offices. Discussions around trip peering and transit and trying to shake down upstream providers is sort of how a lot of folks are trying to bring more money out of being dumb pipes, but there is an existential business question for them.



That's more to come on another episode presumably, but now speaking of interesting behavior that varies between different providers as mentioned, this is sponsored by ThousandEyes. Their public cloud performance benchmark is terrific. Is the AWS global accelerator worth the money? Well for that one, tune in next week, or the week after. I'm not sure what the order is, but we will be doing a deep dive into the global accelerator. Do all cloud providers pay the same latency toll when they cross China's great firewall? There are a bunch of different questions that are answered, things that you may not have expected surface in the report, and you can read it now. Go take a look. Sorry Oracle, you are not invited to have your cloud networking performance tested in the report this year, but there's always next year, just grow a little bit and sign a customer or three. Take a look at Snark.Cloud/realclouds. It's absolutely free and it's fascinating. Thanks again to ThousandEyes for their sponsorship of this ridiculous podcast mini-series.


Now, Netflix has been a famous AWS marquee customer for a long time. They spend presumably boatloads of money on AWS, and they wind up having an awful lot of ridiculously impressive conference talks about how they do what they do, and they're very open about their work with AWS.


But what's not necessarily as well known is that when you fire up a Netflix stream, that doesn't stream from AWS, because the bills would be monstrous for data transfer to start. Instead, they do what any large streaming company does, or any company with significant static assets they need to get to customers, and they use a CDN. In Netflix's case, they built their own because of what they do. They call this the Open Connect Project and details of it are on their website, but what it fundamentally means is that they build boxes that have a whole bunch of hard drive space in them, and they ship them to various ISPs. At times in the United States, Netflix is over a third of internet traffic, so having to pay for peering or transit for those providers and upgrade equipment saturate their links isn't a great plan. Here's a box with all the popular stuff on it that you can put in your data centers, and just stream out to your users is compelling. That's a win for most folks.


Now, most of us aren't shipping boxes places, but there are CDNs, you can use. AWS's CloudFront is an example, Fastly, Akamai, CloudFlare, and there's a whole bunch of others that specialize in different things, but a lot of websites use those. What is a CDN? Well, if you have static assets like CSS, Cascading Style Sheets, or images or video or JavaScript includes that you don't necessarily want to have customers half a world away grab from your web server. You can have a CDN handle a lot of these things. First they can provide hosting for those static files, or they can cache them at edge points of presence or POPs much closer to your customers. So the benefit there is that they can wind up having things that requires significant page load time, or significant latency, because bandwidth concerns way closer to customers, meaning that each request is fulfilled far sooner. Sure, it might only be a hundred milliseconds or so per request, but if you take a look at modern crappy web design, there are 30 to 60 different elements that are often gathered just to load a relatively simple page.


Ad networks make this far, far, far worse. The entire value proposition behind Content Delivery Networks or CDNs as well is that they're also generally terrific at infrastructure. Very often they'll have their own private links that let them speak back to an origin that is faster than traversing the public internet. There's a lot more on that by the way, in the Cloud Performance Benchmark report at Snark.Cloud/realclouds. And they're also able in many cases to withstand distributed denial of service attacks. This goes back to the aforementioned jerks on the internet. A DDOS for those who aren't familiar is when bad actors wind up throwing a bunch of garbage traffic at various websites in an attempt to take them down. CDNs generally are used to seeing this, and have a bunch of different mitigations in place. Some of them are technical in nature as far as being able to identify bad traffic and drop it early, whereas others solve the problem rather handily with giant piles of bandwidth. It's somewhat hard to flood an enormous pipe when it can handle more traffic than you can throw at it.


The best part of all is that CDNs tend to generally be largely single-purpose, and relatively easy to switch between. So if you're looking to have your static assets close to an end user, paying a company that specializes in solving that specific problem, who is already invested the not insignificant infrastructure costs to build that out, it makes an awful lot of sense. There've been a number of different approaches to figure out which CDN is best, and the easy answer for that is the one that works for you. Every CDN tends to have different strengths and weaknesses. For example, AWS's CloudFront is fantastic and a lot of things, but it takes what feels like years to update a distribution. In practice only 20 to 30 minutes, but I sometimes lose interest in the middle of writing a tweet. I don't have that kind of attention span.


To sum all of this up, what really is incredible about the internet is how much goes on under the hood just to make very basic, low-lying things work. What's amazing is not in fact that all this complexity is there, and that you don't have to think about it, but that it works at all, because there's so much that can cause problems from a technical perspective, whenever you're dealing with real world infrastructure, it's expensive. It takes a long time to fix, but when's the last time working with the cloud that you had to think about any of these things?


Of course, if you work at one of the cloud providers, that does not apply to you. Thank you for thinking about these things so those of us building Twitter for pets, obnoxious troll websites don't have to. That sums up the third week of what I'm calling Networking in the Cloud. I am cloud economist Corey Quinn. If you're enjoying this mini-series, please leave it a five star review on iTunes. If you're hating this mini-series, you don't have to listen to it, but please leave a five star review on iTunes anyway, because gamification is how this works. I will be back next week. Thank you for listening to this show, and thanks again to ThousandEyes for their generous sponsorship of my ridiculous nonsense.


Announcer: This has been a HumblePod Production. Stay humble.


Thu, 14 Nov 2019 03:00:00 -0800
EC2 Instances Now On Layaway
AWS Morning Brief for the week of November 11th, 2019.
Mon, 11 Nov 2019 03:00:00 -0800
Networking in the Cloud Fundamentals, Part 2

About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.


Transcript

An ancient haiku reads, "It's not DNS. There's no way it's DNS. It was DNS."


Welcome to the Thursday episode of the AWS Morning Brief. What you can also think of as networking in the cloud. This episode is sponsored by ThousandEyes and their Cloud State Live Event Wednesday, November 13th from 11:00 AM until noon, Central Time. There'll be live streaming from Austin, Texas, the live reveal of their latest cloud performance benchmark where they pit AWS, Azure, GCP, IBM, and Alibaba cloud against each other from a variety of networking perspectives. Oracle Cloud is pointedly not invited. If you'd like to follow along, visit snark.cloud/cloudstatelive, that's snark.cloud/cloudstatelive, and thanks to ThousandEyes for their sponsorship of this ridiculous yet educational podcast episode.


DNS, the domain name system, it's how computers translate numbers into something humans can understand when those humans have a first language that is not math. Put more succinctly if I want to translate www.twitterforpets.com into an IP Address of 1.2.3.4, I probably want a computer able to do that because humans find it easier to remember twitterforpets.com. Originally, this was done with a far more manual process. There was a file on every computer on the internet that was kept in sync with each other. The internet was a smaller place back then, a friendlier time and jerks who are trying to monetize everything at the expense of others were no longer lurked behind every shadow, so how does this service work?


Well, let's go back to the beginning. When you look at a typical domain name, let's call it www.twitterforpets.com there's a hierarchy built in and it goes from right to left. In fact, if you pick any domain you'd like that ends .com, .net, .technology, .dev, .anything else you care about there's another dot at the end of it. That's right. You could go to www.google.com., and it works just the same way as you would expect it to. That dot represents the root and there are a number of root servers run by various organizations that no one entity controls scattered around the internet and they have an interesting job where their role is to resolve who is the authoritative responsible DNS server for the top-level domains. That's all that the root servers do.


The top-level domains, in turn, have name servers that refer out to who is responsible for any given domain within that top-level domain and so on and so forth. You can have subdomains running at your own company. You could have twitterforpets.com but all of the engineering.twitterforpets.com domains are delegated to a subname server out and so on and so forth. It can hit ludicrous lengths if you'd like. Now, once upon a time, this was relatively straightforward because there were only so many top-level domains that existed; .com, .net, .org, .edu, .mil and so on and so forth, and the governing body, ICAN, decided, "You know what's great? Money," so they wound up, in turn, going for additional top-level domains that you could grab the .technology, .blog, .underpants for all I know, no one can keep them all in their head anymore and one leaps to mind of an incredibly obnoxious purchase by google.dev.


Now, you can have anything you want .dev exist as a domain because Google has taken responsibility for owning that subdomain. Why is that obnoxious? Well, historically for the longest time on the internet, there were a finite number of top-level domains that people had to worry about. So internally, when people were building out their own environments, they would come up with something that was guaranteed never to resolve, .dev was a popular pick. You could put that to a local name server inside your firewall or you could even hard-code it on your laptop itself and it worked out super-well. Now, anyone who registers whatever domain you picked has the potential to set up a listener on their end. That is not just a theoretical concern. I worked at a company once that had their domain.com as their external domain and domain.net for their internal domain, which is reasonable, except for the part where they didn't own the .net version of their domain.


Someone else did and kept refusing offers to buy it, so periodically, we would try and log into something internal while not being on the VPN, despite thinking that we were, and type a credential into this listener that is set up and immediately have to reset our credentials. It was awful. Try not to do that. If you use a development domain, make sure you own it, it's $12, everyone will be happier with this. Now, a common interview question that people love to ask when it comes to CIS Admins, SRS, DevOps, whatever we're calling them this week, is when I punch www.google.com into my web browser and I hit enter how does it translate that into an IP address?


There're a lot of things you can hit, but by and large, the way that it works is something like this. Oh, and a caveat they love to add in because otherwise, this gets way more complicated, is every server involved has a cold cache, and we'll get to what that means in a bit, but at that point, your browser then says, "Oh, who has www.google.com?" It passes that query to the system resolver on your computer that goes through a series of different resolution techniques. It usually will check the /etc/host's file if it's on a Mac or a Linux style box, and if there isn't anything hardcoded in there, which there is it for purposes of this exercise, it queries the systems external resolver.


This is usually provided by your ISP, but you can also use Google's public resolvers 8.8.8.8 And 8.8.4.4, Cloudflare's 1.1.1.1, OpenDNSs, which is really weird and no one can remember them off the top of their head, but there're a lot of different options. When that gets queried, it's looks at that www.google.com because it has a cold cache its first question is great, "Who owns .com?" It queries the route name server. The route name server says, "Oh, .com is handled by the .com TLD authoritative servers," and it passes that out. The route name server then returns who's authoritative for.com to the resolver. The resolver says, "Great," and then queries is the authoritative name server for .com, "Who has www.google.com?" and it returns the authoritative name servers for google.com.


Now, something strange if you were to actually try this yourself is that the answer to that question is generally ns1.google.com that sets up the opportunity for an infinite loop where oh, nsi.google.com. Ask .com, "Who has nsi.google.com?" except for the part that when it returns with that result specifically, it includes an IP address. That IP address is known as a glue record to break that circular dependency. Glue records are often one of those things that pop up in CIS Admin type interviews to prove the interviewer thinks they're smarter than you are. From there, the resolver then queries and nsi.google.com, "Who has www.google.com?" and the ns1.google.com authoritative server, in turn, responds with an IP address. The resolver caches that result while passing it back to the original requester and the next time that resolver is queried, it has that in cache until the TTL expires.


What is the TTL? It stands for time to live because a lot of these things don't change very often, but they do change from time to time. For example, if I want to re-point my website from one provider to another, I don't want everyone to continue going to the old provider in perpetuity, but I also don't necessarily want to slow everyone down when they're querying, "Who has responsibility for my site," and going through that whole DNS chain every single time. Setting a reasonable time to live values as a bit of an art when it comes to DNS, some forms of load balancing use incredibly low values in case it changes on a minute to minute basis, but by and large, what happens is when anything along that path queries and gets a result, it comes with a time to live field and when that gets exceeded, the result is considered stale and is discarded and the query has to go out again.


Let's talk a little bit about how that works in a time of cloud. However, first, let's talk a little bit more about Cloud State Live. Who's going to be there? Well, from the world of cloud, the internet and various network experts from both digital native and digitally transforming companies, they're going to be there, media and industry analysts will be there, which is how I managed to sneak in last year because it turns out when you call yourself an analyst, there's no actual certifying body that proves it, and I had everyone fooled. Anyone and everyone with a vested interest in the cloud is welcome to join Cloud State Live either in person in Austin, Texas or on the live stream where I will attempt to aggressively live Tweet it in my typical style. Last years was in San Francisco, so I was able to sneak in without hopping a plane. This year I'll be doing it remotely, so you're definitely going to want to follow along here.


They're also teasing the reveal of a major innovation that will finally change the power dynamic for every business that relies on the internet, which, of course, is every business. Find out more and see if they live up to that at snark.cloud/cloudstatelive. That's snark.cloud/cloudstatelive and my thanks to ThousandEyes for their sponsoring this ridiculous yet strangely informative podcast. Now, in a world of cloud, there are two different kinds of resolvers, the same as there on the internet, the authoritative servers that own the records for a zone and the resolver that winds up going out and figuring out what everything else on the internet has. Route 53 is AWS's authoritative DNS service and unlike any other public service that AWS has, it offers a 100% SLA, meaning it will always be available it'll always be up.


And some folks don't believe that and you shouldn't. I take SLA guarantees with a grain of salt except for the fact that it's DNS. If they're publishing a 100% SLA, then services internal to AWS inherently are going to be building to that 100% SLA. So should Route 53 goes down, and at some point, it almost certainly will because it's a computer computers break, it's what they do. Then internally we'll almost certainly see significant outages of other AWS services with baked independencies on Route 53. So if you're looking to get out of any potential DNS issues by just having a secondary provider available, you may have to do some more work than just DNS.


Now, cloud providers are fascinating in this world just because they have built systems that are fully compatible with DNS because it is a worldwide well-known protocol and you can't just build your own without some serious buy-in from other folks, but they also wind up doing it in their own special way. Common question that you'll get in these interviews again is, "Does DNS speak UDP or TCP?" And the easy answer is, "Oh, it speaks UDP," which it does, but there are exceptions and those exceptions are what that condescending interviewer at Google almost certainly wants to hear.


UDP, in a DNS context, is limited originally to 512 bytes. That's why there are only 13 root name servers, anything more wouldn't fit in a single DNS packet. Now, if the result is larger than 512 bytes, what happens traditionally is that UDP packet fits as much data as it can and then it sets the truncate bit in the packet, meaning that it is left up to the client to decide, "Do I just make do with these partial results, or do I retry using TCP," you can't guarantee that any particular client is going to do exactly what you expect, so you have to account for that. So the correct answer to, "Does DNS speak UDP or TCP?" Is both, but there's one other edge case as well.


All of the records for a zone live in what is known as a zone file. Zone files are used as effectively the authoritative record right now of what lives inside a given zone and when one updates on a primary DNS server, a secondary DNS server pulls it to wind up figuring out what has changed. There are two ways that zone transfers happen in legacy systems, by which I mean not cloud systems, by which I mean computers you control. The AXFR, which is a complete transfer of the entire zone file and the IXFR, which is a partial transfer of just what's changed. Now, Route 53 supports neither of those and it can work for some use cases. It causes problems for others, but it does wind up being a difference that people sometimes forget about and if you're trying to pair this with some other form of DNS server, you have some work to do.


Now, lastly, before we sign off, I want to talk about a few stupid DNS tricks that I love. My personal favorite story is that Route 53 is my personal favorite database because again, it has that 100% SLA, you can query it and DNS is fundamentally a large key-value store. Some would say a key-value store isn't a database, but Reddit calls itself one. So who are we to complain? Now, originally I would use text records for various systems I had to give me further information about them, what rack they lived in, et cetera. So you could make a text record query for any given resource and get a pile of information back. There're much better ways to do this these days, but it mostly worked.


That said it was a database and I maintain Route 53 remains my favorite database. You could take it a step further, although Route 53 doesn't support this yet, and use DNS itself as a transport layer for something else riding on top of it. Iodine is a good example of this. You can put TCP streams over DNS as a transport so you can have a VPN that winds up going over DNS, or open VPN can be hacked to make it work over DNS as well.


Why would you do this? Well, an awful lot of terrible captive portals in various coffee shops and whatnot won't let you connect to the internet without paying them or giving over a bunch of personal information, but they will resolve DNS externally. So you can sort of pivot over the top of any barriers by using DNS. Now, rather than just being a scammy way to wind up squeezing money out of people when you don't deserve it this can be a serious security concern because you can start using DNS to exfiltrate data from inside of an environment. It's always DNS, even when it's not. That more or less rounds up what I had to say about DNS this week. If you disagree with anything I have to say first, let me condescendingly tell you you're wrong, even though you're probably not. Secondly, feel free to chime into the conversation on Twitter. I'm Quinny Pig that's Q-U-I-N-N-Y Pig or visit me on the web at www.lastweekandaws.com, I'm cloud economist, Corey Quinn, and I'll talk to you next week about more network things.


Announcer: This has been a HumblePod Production. Stay humble.

Thu, 07 Nov 2019 03:00:00 -0800
The Rain in Spain Falls Mainly on the Control Plane
AWS Morning Brief for the week of November 4th, 2019.
Mon, 04 Nov 2019 03:00:00 -0800
Networking in the Cloud Fundamentals, Part 1

Links Referenced


Transcript


UDP. I'd make a joke about it, but I'm not sure you'd get it.


This episode is sponsored by ThousandEyes. Think of ThousandEyes as the Google Maps of the internet. Just like you wouldn't dare leave San Jose to drive to San Francisco without checking if 101 or 280 was faster and yes, that's a very localized reference to San Francisco Bay area. Businesses rely on ThousandEyes to see the end to end paths their apps and services are taking from their servers to their end users to identify where the slowdowns are, where the pileups are hiding and what's causing the issues. They use ThousandEyes to see what's breaking where and importantly, they share that data directly with the offending service providers to hold them accountable and get them to fix the issue fast, ideally before it impacts end users. You'll be hearing a fair bit more about ThousandEyes over the next 12 weeks because Thursdays are now devoted to networking in the cloud. It's like screaming in the cloud, only far angrier.


We begin today with the first of 12 episodes. Episode one, the fundamentals of cloud networking. You can consider this the AWS morning brief networking edition. So a common perception in the world of cloud today is that networking doesn't matter, and that perception is largely accurate. You don't have to be a network engineer the way that any reasonable systems or operations person did even 10 years ago, because in the cloud, the network doesn't matter at all until suddenly it does at the worst possible time, and then everyone's left scratching their heads.


So let's begin with how networking works, because a computer in 2019 is pretty useless if it can't talk to other computers somehow. And for better or worse, Bluetooth isn't really enough to get the job done. Computers talk to one another over networks, basically by having a unique identifier. Generally, we call those IP addresses here in the path that this future has taken. In a different world, we would've gone with token ring and a whole bunch of other addressing protocols, but we didn't. Instead we went with IP, the unimaginatively named internet protocol, and with the current version of the internet protocol, version four, we're not talking about IPv6 because let's not kid ourselves, no one's really using that at scale despite everyone claiming that it's going to happen real soon now.


So there are roughly 4 billion IP addresses and change, and those are allocated throughout effectively the entire internet. When this stuff was built back when it was just defense institutions and universities on the internet, 4 billion seemed like stupendous overkill. Now it turns out that some people have 4 billion objects on their person that are talking to the internet and all chirping and distracting them at the same time when you're attempting to have a conversation with them.


So those networks are broken down into subnetworks or subnets, for lack of a better term. And they can range anywhere from a single IP address, which in CIDR, C-I-D-R parlance is a /32 to all 4 billion and change, which is a /0. Some common ones tend to be /24, which is 256 IP addresses, of which 254 are usable and you can expand that into 512 with a /23 and so on and so forth. The specific math isn't particularly interesting or important and it's super hard to describe without some kind of whiteboard. So smile, nod and move past that. So then you have all these different subnets. How do they talk to one another? I mean the easy way to think of it is, "Oh, I have one network, I plug it directly into another network and they can talk to each other."


Well, sure in theory. In practice, it never works that way because those two networks are often not adjacent. They have to talk to something else, go through different hops to go from here to there to somewhere else, to somewhere else to finally the destination it cares about. And when you take a look at the internet as being this network that spans the entire world, well that turns into a super complicated problem because remember, the internet was originally designed to be something that could withstand a massive disruption generally in the terms of nuclear war where effectively large percentages of the earth were no longer habitable, had to be able to reroute around things and routing is more or less how that wound up working.


The idea that you could have different paths to get to the same destination and that solves an awful lot. It's why the internet is as durable as it is, but also explains why these things are terrible and why everyone is super quick to blame the network. One last thing to consider is network address translation. They're private IP address ranges that are not reachable over the general internet, anything starting with a 10 for example, the entire 10/8 is considered private IP address space. Same with one 192.168, anything in that range is as well and anything between 172.16 and 172.20, give or take, if I'm wrong, don't at me. It's been a very long week and translating those private IP addresses into public IP addresses is known as network address translation or NAT. We're not going to get into the specifics of that at the moment, but just know that it exists.


Now, most of the traditional networking experience doesn't come from working in the cloud. It comes from working in data centers, a job that sucks and some of the things that you learn doing that are tremendously impactful. They completely change how you view how computers work and in the cloud, that knowledge becomes invaluable. So let's talk a little bit about what it looks like in the world of cloud, specifically AWS, because AWS had effectively five years of uninterrupted non-compete time where no one else was really playing with cloud. So by the time everyone else woke up, the patterns that AWS had established were more or less what other people were using. This is the legacy of Rip Van Wrinkling through five years of cloud. If you don't want me to talk about AWS and talk about a different company instead, that other company should have tried harder.


In AWS context, they have something known as a virtual private network or a VPC, and planning out what your network looks like in those environments is relatively challenging because people tend to make some of the same mistakes here as they did in data centers. For example, something that has changed is that common wisdom in a data center is that anything larger than a /23 or a subnet that has 512 IP addresses in it was a complete non-starter because at that point that is a large enough subnet that your broadcast domain or everything being able to talk to everything is large enough that it was going to completely screw over your switch. It would get overwhelmed. You'd wind up with massive challenges and things falling over constantly, so having small subnets was critical.


Now in the world of cloud, that's not true anymore because broadcast storms aren't a thing that AWS and other reasonable cloud providers allows to happen. It winds up getting tamped down. There are rate limits. They do all kinds of interesting things that mean that this isn't really an issue. So if you want to have a massive flat network inside of a VPC, knock yourself out, you're not going to break anything, whereas if you're doing this in a data center, you absolutely will. So that's one of those things. It needs to be adjusted as you start going from legacy on premises environments into the world of cloud.


Another common network failure mode that hasn't changed is that putting subnets next to each other was kind of a thing. If you have a bunch of /24s, let's say a 10.0.1/24, anything ranging from 10.0.1.0 to 10.0.1.255 would be in that subnet and people would naturally want to put a subnet right next to it, 10.0.2 and so on and so forth. The problem is by packing them right next to each other when one thing explodes, such as having a whole bunch more computers there than you thought or hey, there's this container thing now where you're going to have a whole bunch of IP addresses tied to one computer, suddenly the entire pattern changes and you can't expand the subnet to make it bigger because you'll run into the next subnet up, and then you have to do data center moves and this was freaking horrible.


Everyone hated it. No one liked it and nothing good came of it. So now, that's still a problem. You want to make sure that you can expand your subnet significantly without stomping into other ranges. Having to plan your network addresses inside of a VPC is still there. It's the sort of thing that you can do really easily and you think of it without even stopping for breath the second time you see it because the first time it happened to you, it leaves scars and you remember it.


Something else that I love is that bad cables aren't a thing anymore in the world of cloud. How do you handle a cable? How do you crimp it appropriately? You can interview people working in data centers by asking them from memory to tell you what the pinout is on the ethernet B standard because there's eight cables inside of a CAT 5 cable and crimping them in the right order is incredibly important. Now, the answer is I don't care. I don't have to care. I don't know about these things, but you still remember them. For example, when you lose an entire day due to a bad cable and it might be that the batch was bad, so then you wind up with these weird intermittent failures. The lesson you take from this is when you throw away the bad cable, you cut it first with nature's best firewall, a wire cutter, because otherwise, some well-meaning idiot and yes, I've been that idiot will take it out of the trash bag, "Oh, this cable looks fine. I'll put it back at the spare parts bin."


Then you're testing one bad cable with another bad cable and well it couldn't be the cable in that case and people go slowly mad. The pro move is to wind up having a network cable tester in the data center when you're building things out. Getting away from the hardware a bit because that's the whole point of cloud where we don't have to think about it anymore, you also have certain assumptions that get baked in inherently into anything that you're building when you're doing things on premises. You have to worry about things like cables failing. You don't generally think about that in terms of cloud. You have to worry about things like bottle-necking at the top of the rack switches, where you have a whole bunch of systems that are talking to each other at one gigabit per second and you have a 10 gigabit link between racks.


Well, okay. More than 10 servers talking to another are not going to fit over that link, so you have to worry about bandwidth constraints. You have to worry about cables failing and well, how are we going to fix that? How are we going to route over to a secondary set of cables? In terms of cloud, you generally don't have to think about most of, if any of that. Things even like DRI failures wind up not being an issue, let alone cabling issues. It just sort of slips below the surface of awareness. Same story with routing inside of AWS. Route tables are relatively simplistic things in the world of cloud compared to any sort of routing situation that you have in the world of data centers. The reason behind that is that you're not having 15 AWS accounts all routing through each other to get from one end of your network to another.


If you are, for God's sake, stop it and do literally anything else other than what you're currently doing because it's awful. This seems like a good point to pause and talk a little bit more about ThousandEyes, which is not abjectly awful. They tend to focus on a couple of different things. The first is consumer digital experience, I think in terms of SaaS providers. They care about providing visibility to global network storage because consumers don't wait for anything anymore. I know I'm impatient and most people I know are too. If they're not, I've probably gotten impatient and stopped waiting around for them. If Netflix is slow, people move to Hulu. If Uber isn't loading, we'll take a Lyft. If Twitter's down, they're on to Facebook or going somewhere profoundly more racist if they can find it. So businesses who simply wouldn't exist without the internet absolutely rely on ThousandEyes to give them visibility into effectively every potential point of failure along the service delivery chain.


So when things break, because things always break, welcome to computers, they aren't wasting precious time in war rooms trying to figure out whose fault it is. The second type of customer that tends to bring ThousandEyes in is for the employee digital experience side of the house. We're all on Office 365, Salesforce, WebEx, Zoom, or other things that don't work, Zendesk, JIRA, GitHub, et cetera, et cetera. Because if employees can't get their job done, you're paying an awful lot of expensive people to sit around, twiddling their thumbs and complaining about not being able to do work. So internal IT teams who manage massive SaaS deployments use ThousandEyes for visibility to what's breaking where. We're going to hear a lot more about ThousandEyes over the next 12 weeks or so, but check them out. Visit them at thousandeyes.com. That's thousandeyes.com and my thanks to them for supporting this ridiculous run of my ridiculous podcast.


Let's talk about Network ACLs or NACLs. I don't care how you pronounce it. I'm sure one of us is wrong and I just don't care anymore. They're a terrible idea, and the reason they're a terrible idea, whether they're in the terms of cloud or in an on premises environment is that people forget they're there. Amazon's guidance is to set the default NACL and don't touch it ever again, and the problem there is because NACLs plus routing plus subnet groups plus security groups on top of that all replicate data center tiers of complexity. We really don't need that anymore. If you take a look at any on premises environment that's migrated as a lift and shift to the cloud, you've noticed that they have a tremendously complicated AWS network that they don't need to have to be that complicated. They're just replicating what they had in their rusted iron data centers in the world of cloud.


You don't need that. In the cloud, internal networks can be largely flat because security is no longer defined based upon what IP address something has, but by roles that they assume. You can move up the stack and get better levels of access control without having to depend on network borders being the thing that keeps your stuff safe. Something else to consider is that private versus public subnets aren't really a thing in the on-prem world. It's just a subnet that does or doesn't route to the internet. In the cloud, they're absolutely different because a private subnet does things in it that have no public IP and things in public subnets do have public IP addresses and that's how you accidentally expose your database to the entire internet. If you do have things in a private subnet that need to talk to the internet, that's where NAT comes in.


We mentioned that a little bit earlier in this episode. You used to have to run NAT instances yourself, which was annoying, and then AWS came out with the managed NAT gateway, which is of the freaking devil because it charges a $0.045 per gigabyte data processing fee on everything passing through it. That's not data transfer. That is data processing because once again, it's of the devil. If you have this at any significant point of scale, for God's sake, stop. I'll be ranting about this at a later episode in this mini series, so we're going to put a pin in that for now before I blow the microphone and possibly some speakers. Lastly, something else that tends to get people in trouble is heartbeat protocols, where you have two routers in an on premises environment, one on active, one on standby. If you take a look at failure in analysis, the most common cause of failures is not a router failing.


It's the thing that keeps them in check where they're talking to one another to ensure that both are still working, that thing fails. So then you have two routers vying for control. They aren't talking to one another anymore and it brings your entire site down. That's not a great plan. Consider not doing that in the world of cloud if you can avoid it.


So in conclusion, the network does matter, but if you do it right, it doesn't matter as much as it once did. That said, in the next 11 weeks, we're going to talk through exactly why and how it matters. I'm cloud economist Corey Quinn. Thank you for joining me and I'll talk to you about the network some more next week.

Thu, 31 Oct 2019 03:00:00 -0700
Last of the JEDI
AWS Morning Brief for the week of October 28th, 2019.
Mon, 28 Oct 2019 03:00:00 -0700
AWS CloudWatch Anomaly Wake-Up Calls
AWS Morning Brief for the week of October 21st, 2019.
Mon, 21 Oct 2019 03:00:00 -0700
The Right to Bare Metal ARMs
AWS Morning Brief for the week of October 14th, 2019.
Mon, 14 Oct 2019 03:00:00 -0700
Hope and Change Management
AWS Morning Brief for the week of October 7th, 2019.
Mon, 07 Oct 2019 03:00:00 -0700
API Has Two Syllables
AWS Morning Brief for the week of September 30th, 2019.
Mon, 30 Sep 2019 03:00:00 -0700
NoSQL Workbench Gets Rapid Sequel
AWS Morning Brief for the week of September 23rd, 2019.
Mon, 23 Sep 2019 03:00:00 -0700
CSI: Driver Support YEEEEEAAAAAAAAHHHHHH!
AWS Morning Brief for the week of September 16th, 2019.
Mon, 16 Sep 2019 03:00:00 -0700
Amazon SageMaker Private Worker Throughput Worker

AWS Morning Brief for the week of September 9th, 2019.

Mon, 09 Sep 2019 03:00:00 -0700
me-south-1 is Southern Maine, right?
AWS Morning Brief for the week of September 2nd, 2019.
Mon, 02 Sep 2019 03:00:00 -0700
Amazon Fivecast
AWS Morning Brief for the week of August 26th, 2019.
Mon, 26 Aug 2019 03:00:00 -0700
The Seven Things You Can’t Say at re:Invent
AWS Morning Brief for the week of August 19th, 2019.
Mon, 19 Aug 2019 03:00:00 -0700
SageMaker Supports R, Pirates' First Love Remains C
AWS Morning Brief for the week of August 12th, 2019.
Mon, 12 Aug 2019 03:00:00 -0700
CapitalOne's CapitalZeroDay
AWS Morning Brief for the week of August 5th, 2019.
Mon, 05 Aug 2019 03:00:00 -0700
Spot Instances for IBM Enterprise Linux
AWS Morning Brief for the week of July 29th, 2019.
Mon, 29 Jul 2019 03:00:00 -0700
Elastic Fabric Adapters and the Suspenders of Disbelief
AWS Morning Brief for the week of July 22, 2019.
Mon, 22 Jul 2019 03:00:00 -0700
Marching in CloudFormation: Their Rebuilding Year
AWS Morning Brief for the week of July 15, 2019.
Mon, 15 Jul 2019 03:00:00 -0700
If You Can't Make It In New York, AWS Will
AWS Morning Brief for the week of July 8th, 2019
Mon, 08 Jul 2019 03:00:00 -0700
reInforce Meant Learning
AWS Morning Brief for the week of July 1st, 2019.
Mon, 01 Jul 2019 03:00:00 -0700
AWSECS4K8S(EKS) Finally Renamed



Mon, 24 Jun 2019 03:00:00 -0700
The AWS Backwards Shuffle
AWS Morning Brief for the week of June 17th, 2019.
Mon, 17 Jun 2019 03:00:00 -0700
Tom Clancy’s Systems Manager OpsCenter
AWS Morning Brief for the week of June 10th, 2019.
Mon, 10 Jun 2019 03:00:00 -0700
Data Ah-Pee Goes GA
AWS Morning Brief for the week of June 3rd, 2019.
Mon, 03 Jun 2019 03:00:00 -0700
Welcome to AWS Morning Brief
Welcome to AWS Morning Brief, the podcast that summarizes the news from the AWS ecosystem--and makes fun of it.
Fri, 31 May 2019 03:00:00 -0700
-
-
(基於 PinQueue 指標)
0 則留言