AWS S3 Is 'pushing To Become Primary Storage For A Lot Of Applications'

re:Invent At the recent AWS re:Invent conference, the hype was all around AI, but the big launch for many users was S3 (Simple Storage Service) Express One Zone – a S3 tier offering much lower latency than standard S3 buckets. This means S3 can be used directly by a wider range of applications, altering the storage cost and performance calculations.

Calling S3 Express a new tier is hardly adequate. It is a new kind of storage that introduces a second bucket type: the Directory Bucket. "Directory buckets organize data hierarchically into directories as opposed to the flat storage structure of general purpose buckets. There aren't prefix limits for directory buckets, and individual directories can scale horizontally," the docs explain.

Another difference between S3 Express and standard S3 is that Express is limited to a single Availability Zone – normally one local to the compute instances that will be used with it.

There are a number of S3 features that are not supported by S3 Express – including S3 replication, S3 versioning, static website hosting, PrivateLink for S3, server-side encryption with AWS Key Management Service or customer-provided keys, S3 Event notifications, and more. S3 Express uses the S3 API, and offers great performance, but does not implement the full range of S3 features.

Andy Warfield, AWS distinguished engineer Amazon S3

Andy Warfield, AWS distinguished engineer Amazon S3

At re:Invent Las Vegas, Andy Warfield, AWS distinguished engineer for Amazon S3, described the changing nature of S3 – the first AWS service.

"When S3 launched 17 years ago, there was talk at the time about it being like the storage system for the internet. But in reality, it was broadly treated as an archival store that was secure and durable, and people built websites on top," explained Warfield. "A remarkable thing over the past five and ten years is the uptick in analytics workloads directly over S3, the replacement of enterprise HDFS [Hadoop Distributed File System] clusters with MapReduce and Spark jobs," he told The Register, and this "really changed the shape of the service."

"S3 moved from being storage and durability-centric to also having a throughput focus. Over the past five years the same customers that pulled us to our throughput focus are now pulling us to a latency focus. From a developer perspective it's around integrating S3 more into the application."

The old S3 pattern is that developers pull data out of S3, write it to an intermediate store for interaction, and then perhaps write it back to S3 for durability. "What we're hearing is customers just want to use the APIs and access that data," observed Warfield. "S3 is pushing to become primary storage for a lot of applications."

A discussion on Hacker News offers some background from former AWS engineers. "This is the low latency S3 that is written in Rust. Finally launched after years in the making," wrote one. Another who "used to work on S3" argued that standard S3 has "slow first byte latency" because of use of hard drives rather than SSDs, Java with garbage collection on the request path, and that "to reduce storage costs, objects are erasure-coded 'wide', which means many hosts are involved in servicing a request. This indicates only one such sub-request has to be slow to slow your request down. The new storage class is SSD-backed, presumably doesn't use Java anywhere, and doesn't stripe your data across as many hosts."

If S3 Express is a reimplementation of the S3 back-end in Rust, that would explain both its performance boost and why it is different and less comprehensive than standard S3.

While Warfield could not confirm the details, he did reveal that the teams have "reinvented a lot of what we'd done there and done a ton of work in Rust … we published a paper about rewriting ShardStore, which is our on-disk layout, completely in Rust. We brought in a whole bunch of formal verification tools at that level to get correctness, because it's pretty hair-raising code to be replacing, so that trend and the skills we've built around Rust we've carried into our work on Express."

Warfield also explained that to get the best out of S3 and Express, a lot of work had been done in the client tools. "We launched this thing called the CRT, the Common Runtime, which is a native code library that embeds all of our best practices for performance, so it automatically parallelizes transfers. Its goal is to saturate the NIC. On top of the CRT we've delivered integrations into S3 for Hadoop, we launched a PyTorch plugin, we launched MountPoint, which is a FUSE connector that gives you a file connection into S3. All of those things now support Express at launch."

S3 Express is more expensive than standard S3, though it can be used as an alternative to some of the other storage options that AWS offers. Cost is important as well as performance, so what guidance is there for customers who want to optimize their value?

The key thing, Warfield explained, is that with S3 Express the request pricing is around 50 percent lower for a lot of workloads, but the storage capacity costs more. "So we expect workloads to be relatively short-lived, and the up to 50 percent saving that we see on the request side are translating into up to 80 percent savings on end-to-end workload costs because jobs finish faster, you have less compute, fewer GPU hours."

The pattern then is that developers might transfer data into S3 Express for a job, perform that job, and then move it out. Warfield points us towards a "less glamorous" announcement at re:Invent concerning AWS Batch operations. They now allow users to specify "an entire bucket, prefix, suffix, creation date, or storage class" in order to move S3 objects in a single step, making it easier to do this kind of data transfer. He also hinted that we will see more such convenience features in future.

What's next for S3 Express? Warfield explained that "a lot of the customer asks really break into two categories. One is the sharp edges that are different between this single zone bucket and the rest of S3, so it's polish, and there's a lot of work to do. The other one is around those like more interactive storage requests. You'll see us focus on this much more interactive interface to S3 data, filling out the things that you’ll want to do with that kind of interface." ®

RECENT NEWS

From Chip War To Cloud War: The Next Frontier In Global Tech Competition

The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more

The High Stakes Of Tech Regulation: Security Risks And Market Dynamics

The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more

The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics

Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more

The Data Crunch In AI: Strategies For Sustainability

Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more

Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser

After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more

LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue

In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more