I'm pleased to announce that after some reflection, Yahoo! has decided to discontinue the "The Yahoo Distribution of Hadoop" and focus on Apache Hadoop. We plan to remove all references to a Yahoo distribution from our website (developer.yahoo.com/hadoop), close our github repo (yahoo.github.com/hadoop-common) and focus on working more closely with the Apache community. Our intent is to return to helping Apache produce binary releases of Apache Hadoop that are so bullet proof that Yahoo and other production Hadoop users can run them unpatched on their clusters.
Until Hadoop 0.20, Yahoo committers worked as release masters to produce binary Apache Hadoop releases that the entire community used on their clusters. As the community grew, we experimented with using the "Yahoo! Distribution of Hadoop" as the vehicle to share our work. Unfortunately, Apache is no longer the obvious place to go for Hadoop releases. The Yahoo! team wants to return to a world where anyone can download and directly use releases of Hadoop from Apache. We want to contribute to the stabilization and testing of those releases. We also want to share our regular program of sustaining engineering that backports minor feature enhancements into new dot releases on a regular basis, so that the world sees regular improvements coming from Apache every few months, not years.
Recently the Apache Hadoop community has been very turbulent. Over the last few months we have been developing Hadoop enhancements in our internal git repository while doing a complete review of our options. Our commitment to open sourcing our work was never in doubt, but the future of the "Yahoo Distribution of Hadoop" was far from clear. We've concluded that focusing on Apache Hadoop is the way forward. We believe that more focus on communicating our goals to the Apache Hadoop community, and more willingness to compromise on how we get to those goals, will help us get back to making Hadoop even better.
Unfortunately, we now have to sort out how to contribute several person-years worth of work to Apache to let us unwind the Yahoo! git repositories. We currently run two lines of Hadoop development, our sustaining program (hadoop-0.20-sustaining) and hadoop-future. Hadoop-0.20-sustaining is the stable version of Hadoop we currently run on Yahoo's 40,000 nodes. It contains a series of fixes and enhancements that are all backwards compatible with our "Hadoop 0.20 with security". It is our most stable and high performance release of Hadoop ever. We've expended a lot of energy finding and fixing bugs in it this year. We have initiated the process of contributing this work to Apache in the branch: hadoop/common/branches/branch-0.20-security. We've proposed calling this the 20.100 release. Once folks have had a chance to try this out and we've had a chance to respond to their feedback, we plan to create 20.100 release candidates and ask the community to vote on making them Apache releases.
Hadoop-future is our new feature branch. We are working on a set of new features for Hadoop to improve its availability, scalability and interoperability to make Hadoop more usable in mission critical deployments. You're going to see another burst of email activity from us as we work to get hadoop-future patches socialized, reviewed and checked in. These bulk checkins are exceptional. They are the result of us striving to be more transparent. Once we've merged our hadoop-future and hadoop-0.20-sustaining work back into Apache, folks can expect us to return to our regular development cadence. Looking forward, we plan to socialize our roadmaps regularly, actively synchronize our work with other active Hadoop contributors and develop our code collaboratively, directly in Apache.
In summary, our decision to discontinue the "Yahoo! Distribution of Hadoop" is a commitment to working more effectively with the Apache Hadoop community. Our goal is to make Apache Hadoop THE open source platform for big data.
PS Here is a draft list of key features in hadoop-future:
HDFS-1052 - Federation, the ability to support much more storage per Hadoop cluster.
HADOOP-6728 - A the new metrics framework
MAPREDUCE-1220 - Optimizations for small jobs
At midnight on the morning of 1st Feb 2011 IANA announced two more /8's allocated to APNIC (39/8 and 106/8) leaving the final five /8s, one for each of the RIRs.
The policy is that these final five /8s are allocated to the five RIRs immediately, and we expect this to be formally announced 14:30 Thursday.
So, no more allocations from IANA after that!
(Well, technically, there is a pool for returned allocations, but that is going to be a tad rare)
Sometimes you just need some data to test and stress things. But randomly generated data is awful — it doesn’t have realistic distributions, and it isn’t easy to understand whether your results are meaningful and correct. Real or quasi-real data is best. Whether you’re looking for a couple of megabytes or many terabytes, the following sources of data might help you benchmark and test under more realistic conditions.
- The venerable sakila test database: small, fake database of movies.
- The employees test database: small, fake database of employees.
- The Wikipedia page-view statistics database: large, real website traffic data.
- The IMDB database: moderately large, real database of movies.
- The FlightStats database: flight on-time arrival data, easy to import into MySQL.
- The Bureau of Transportation Statistics: airline on-time data, downloadable in customizable ways.
- The airline on-time performance and causes of delays data from data.gov: ditto.
- The statistical review of world energy from British Petroleum: real data about our energy usage.
- The Amazon AWS Public Data Sets: a large variety of data such as the mapping of the Human Genome and the US Census data.
- The Weather Underground weather data: customize and download as CSV files.
Post your favorites in the comments!
Ideally we want to be supplying *only* IPv6 capable routers by the end of this month. A challenge I know.
Watch this space...
IPv4 finally hitting the end stops - so what next?
Well, lets look at IPv6 first. It has suffered from chicken and egg syndrome a lot for many years - with people not deploying IPv6 services as there are no users, and users not seeing any point as there are no IPv6 services...
Can that work for us now? Can the fact we now have an egg help create chickens faster? I hope so. I hope that momentum will mean IPv6 take up goes really quickly. I hope in a years time we are looking back and laughing at it all. I don't know, and I worry that IPv4 will some how cling on to the bitter end with NAT and mapping and SRV records and all sorts somehow keeping it going... Let's hope not.
We know the big stumbling block has been IPv6 capable consumer routers. At last it is happening. Give it a few months and you will not be able to buy a DSL router that does not do IPv6. And bear in mind, people do not keep this kit for years - you are lucky when things last longer than a 12 month warranty these days. So a couple of years and all end users will, by simple updates, be using an IPv6 capable router.
The ISPs are not daft either. They know they have to move now, and will. Thankfully, whilst it may be months of planning, it is not that hard. Normal maintenance and a few months work and some equipment and systems upgrades... Give it 6 months and ISPs will be IPv6 ready - they have to be.
At that point you have consumers everywhere that happen to have IPv6, without planning it or thinking about it - it will just happen naturally. People won't even realise it has happened and won't realise that www.google.co.uk is now working via IPv6. I will be surprised if this is not within 2 years.
So, that is the rise of IPv6, all looking good and just a question of how long it takes.
What of IPv4? This is where I am really pondering. There are around 3.5 billion IPv4 addresses out there and they are now a limited resource. The last gold mine has been mined dry. What happens to that?
A simple and obvious thing that has to happen right away is the value of IPv4 addresses rises. Until today they were free if you needed them (excepting membership fees). Any higher value was speculative. Now you cannot get them for free any more. They have value. But what does that mean?
Everyone providing any services that consume IPv4 addresses will have to consider the price for that service. It is worth more. It has a higher price than IPv6 usage. It does not matter if broadband, hosting, virtual servers, ssl web sites, whatever. If it uses an IPv4 it is worth more. You can charge more. You need to charge more else you use up what IPv4s you have left.
So the economics change, drastically, and quickly. Simply having IPv4 space is now an asset - it is really worth wasting it on customers? We are not there yet, but how long before ISPs start using NAT even when they do not need to because they need to maximise a disposable asset? Some people will really need IP space, and some could just do with it, and some will have it. This creates a market place. Could there be IPv4 hording? What of IPv4 trading for the sake of IPv4 trading, like trading art - people not actually using the space, just holding it as an asset that will gain value? If fact, using it de-values it as it is harder to stop using it when you sell it...
When IPv4's start trading at £100 each, it may be worth selling a small ISP that has 100,000 IPv4s? No plans to, but what if they get to £1000 each?
Scams... We have not seen what the world of fraud will come up with! I am not sure I can even speculate on that. There will be scams of all sorts. The registries have made changes to make it hard to hijack IP space or sell it when it is not yours, but that won't stop the scams...