We've reduced the prices for Amazon S3 storage again. As is always the case, the cost to store your existing data will go down. This is markedly different than buying a hard drive at a fixed cost per byte and is just one of the many advantages of using cloud-based storage. You can also count on Amazon S3 to deliver essentially infinite scalability, eleven nines of durability (99.999999999%), and your choice of four distinct geographic locations for data storage.
So, starting November 1, 2010, you'll see a reduction of up to19% in your overall storage charges on a monthly basis. We've created a new pricing tier at the 1 TB level, and we have removed the current 50 - 100 TB tier, thereby extending our volume discounts to more Amazon S3 customers.
The new prices for standard storage in the US Standard, EU - Ireland, and APAC - Singapore regions are as follows:
|First 1 TB||$0.150||$0.140 per GB|
|Next 49 TB||$0.150||$0.125 per GB|
|Next 50 TB||$0.140||$0.110 per GB|
|Next 400 TB||$0.130||$0.110 per GB|
|Next 500 TB||$0.105||$0.095 per GB|
|Next 4000 TB
||$0.080 per GB||$0.080 per GB (no change)|
|Over 5000 TB||$0.055 per GB||$0.055 per GB (no change)|
Reduced Redundancy storage will continue to be priced 1/3 lower than standard storage in all regions.
The full price list can be found on the Amazon S3 page. We'll continue to work relentlessly to drive our costs down so that we can pass the savings along to you!
We've got several more announcements related to S3 coming up in the near future, so stay tuned.
The S3 team is hiring Software Development Engineers, a Technical Program Manager, System Engineers, Administrators, and Product Managers. More information and instructions for applying can be found on the Amazon S3 Jobs page.
Just in time to demo at the RightScale User Conference, we started a private beta of two new cluster monitoring features: stacked graphs and heat maps. We’ve been using them for a while internally and they’ve been invaluable to quickly determine the cause of issues. The idea behind these types of graphs is not new, but what we’ve been able to do is automate the back-end machinery that lets you create these graphs across any cluster of servers with just a few clicks.
The cluster monitoring we’ve had for a while shows an individual graph per server. Below is an example showing the cpu load on our web front ends. Each graph shows user time in blue and idle time in green over the course of one week. You can see that one of the last two servers has been sitting idle and the other one was launched less than a day ago. These individual graphs are nice for a small number of servers, but once your cluster has more than about a dozen they cease to be practical.
A stacked graph is a great alternative to display many servers on one graph where their activity contributes to a total or sum. In the case of the web front-ends, they all serve HTTP requests and contribute to the total of HTTP requests served by the cluster. This is what the graph below shows: each color band shows the requests/sec for one server, and the color bands are stacked on top of one another such that you can read the requests/sec served by the application at the top. This now gives a nice overview of what’s happening in aggregate. Also, if one server were serving a lot more requests than the others, you’d be able to spot it because its band would be significantly wider. However, something that is actually not easy to notice in the stacked graph is that two of the servers are not serving requests. You’d have to start counting color bands to notice.
A heat map shows a somewhat different view of the same action. In the heat map below each color bar represents the activity of one server and the color of the bar at each point in time shows how “hot” the server is, i.e., the value of the variable being displayed color coded from blue to red. Here you can again see that the top two servers are outliers, one being idle the whole time, the other having just launched. On the other hand, it’s pretty difficult to make out absolute values for how busy a server is.
The bottom line is that none of the cluster graph types is better than the others, it just depends on what you’re looking for, so it’s nice to be able to flip back and forth. To illustrate this further, here is a real-life example of an issue we encountered. On a Wednesday morning we had alerts going off on a small number of our monitoring servers that showed unusually high load on those servers. We first looked at the heat map plotting I/O wait time for the cluster:
You’ll notice that there are a lot of bars! The heat maps are currently limited to showing 100 servers at a time and we have more than 100 monitoring servers, so you’re seeing a sampling (including the longest running, the shortest running, and some of each different ServerTemplate). Also, you’ll notice the color coding alternates between blue-red and green-orange every 10 servers to make it easier to count. The red bars make it pretty clear that something started to affect a small number of servers around 8am. After looking at a number of other variables we saw the following stacked graph of the number of servers monitored.
Here each color bar represents the number of customer servers monitored by one of our monitoring back-ends. This view displays the activity for a whole week and highlights the fact that a slew of additional servers were monitored right when we got the alerts. From there it was easy for us to pinpoint the cause, which turned out to be a limitation in our monitoring back-end assignment algorithm.
I’m really excited about these new monitoring features in part because I’ve built similar graphs manually several times at different companies and being able to get them automatically is amazing. You may have noticed that we produce the graphs using RRDtool, which we’ve extended quite a bit, in particular to draw the heat maps. The way the graphs are rendered is that a monitoring front-end queries the data series from the appropriate back-end servers and then assembles everything into one graph which is sent to the browser. The result is that each of the two graphs above displays 60,000 data points, that’s a lot of data to be able to see at a glance!
The stacked graphs and heat maps are currently in private beta. If you’re running lots of servers using RightScale and would like to try them out, please drop me an email. One of the tasks still ahead of us before we can release this to everyone is improving the parallelism of the data fetching so we can plot more than 100 servers at a time.
Filed under: Thorsten
Thanks to the around 200 developers who came to Yahoo! recently for our monthly Hadoop User Group meeting. The energy in the packed room was phenomenal, and conversations continued long after the formal sessions.
The event started with Chris Riccomini talking about Pig at LinkedIn. It was great to get a firsthand view of how industry is leveraging the power of Pig to solve their data processing problems. Chris covered how Pig is an integral part of data analytics at LinkedIn. He showed how Pig is used to design, develop, and deliver data products at LinkedIn. He explored a successful example of Pig deployment at LinkedIn, pain points, and integration with Azkaban, Voldemort, Hadoop, and the rest of LinkedIn’s ecosystem. Chris also covered the most frequent gottcha's and learnings, and then concluded with some of his thoughts on the evolution of Pig. The talk generated many interesting questions.
Next we had Dhruba Borthakur and Dmytro Molkov from Facebook talking about HighAvailability of the Hadoop NameNode. Although not many current users face name-node failures, it is certainly the next big hurdle to deliver high availability. Dhruba talked about the specifics Facebook has innovated to make the name node highly available and discussed the design and the advatanges of the proposed design. He described in detail a hot-standby solution called the AvatarNode, then talked about the capabilities of the SecondaryNameNode and the BackupNode. Very insightful presentation.
Finally, we had Ahad Rana from Opencrawl talking about building a scalable Web crawler with Hadoop. Ahad talked about his experience in building an open and accessible Web-scale crawl. He discussed the Hadoop data-processing pipeline, including PageRank implementation. He also described techniques to optimize Hadoop and the design of their URL Metadata service. Finally he concluded with details on how users can leverage the crawl (using Hadoop) today. The discussion generated very detailed questions and folks remained way past the event deadline to understand the internals of search index and how they can leverage Opencrawl to solve their business needs.
We at Yahoo! embrace Hadoop, and are looking for exciting technologies and experiences you want to share. Please contact me via the Hadoop Bay Area User Group Meetup page.