As I've blogged about in the past, we love to listen to our customers, and what we hear from them has a direct and very visible influence on our development process. I can literally watch as the information flows from customer meetings to trip reports to press releases (we write those first, before we build the product) and ultimately into products. I've learned that real data from real customers is a very strong persuasive tool.
Just about a year ago, our customers started to ask us for a new product with increased scalability and durability, at the lowest possible price tag. We took a lot of notes and finally figured out what the heck they wanted. We put a crack team on it and they've been working on realizing the vision for over 10 months.
I'm happy to announce that $NAME is now available in a limited beta test form (sign up here). This service is really cool and I think you are really going to like it.
Here are the most important features:
Scalable - $NAME can go from 0 to 10100 and back again within seconds. You can do this manually or with the aid of some CloudWatch Alarms.
Durable- $NAME will survive for at least a day, no matter what.
Global- $NAME is available in all five AWS Regions now.
Cost-Efective - This service is priced far lower than competitive services. It is also made available to you on a pay-as-you-go basis, so you never consume more $NAME than you actually need.
Compatible- $NAME works with your existing tools and applications; no modifications are required.
Secure- $NAME is fully certified and compliant.
As I've noted before, $NAME is designed to work within your existing environment. Here's a block diagram:
Since this is a web service, there's a complete API, with Create, Describe, and Delete calls. We expect a number of third-party tools and toolkits to announce support for $NAME later today.
Update: $NAME is now supported by Shlomo Swidler and Ylastic. Support from CloudBerry Lab is expected to be ready today. Boto includes full support (read their blog post to learn more).
-- Jeff;
Are you deploying apps on the Rackspace Cloud? You’ll want to check out the recorded webinar, “Managing Apps in the Rackspace Cloud” lead by New Relic, a Rackspace Cloud Tools partner. Rackspace has teamed up with New Relic, the leading provider of application management tools for Ruby, PHP, Java and .NET web apps, to make New Relic’s solution available to Rackspace Cloud Servers customers free of charge.
Brian Doll, application performance engineer at New Relic, offers an overview of the New Relic tool’s capabilities and provide examples of real-world use cases. Key highlights you will learn:
• Monitor the performance live web applications deployed
• Identify and troubleshoot performance issues proactively
• Drill down into individual transactions at the component level
Click below to view the webinar.
Hy fellows,
in one of my last posts I described how to use GIT under Windows.
Now as Zend released Zend Studio 8 and ZF migrates to GIT with the next major release it’s time so see how those can be matched to work together. Note that this is not fixed to Windows but should work regardless of the use OS in any environment.
Prerequisits
What do you need to start?
I assume that you already have a working GITHUB repository as described in the last post.
Then you need a installed version of Zend Studio 8 or newer.
Installing GIT
Now you need to make your Zend Studio aware of GIT. Therefor we need to install EGit. EGit is a module for Eclipse which makes Zend Studio (which is based on Eclipse) understand GIT and provides several tools for working with GIT.
- To install EGit open Zend Studio
- Click “Help”
- Click “Install New Software…”
- Click “Add…”
- Enter “EGit” as “Name”
- Enter “http://download.eclipse.org/egit/updates” as “Location”
- Select all modules by clicking “Select all” (you could omit the two moduls only containing “Source” if wished)
- Click “Next” until you see the license agreements
- Click “Agree” on all license agreements (of course read the agreements before you agree)
- At last click “Finish”
Now EGit will be installed within Zend Studio. As last step you will need to restart Zend Studio (if it is not done automatically).
Security preparations
When you already used GIT you will know that it needs RSA keys. So our next step is to make new keys which will be used within Zend Studio to commit to GITHUB.
- Open Zend Studio
- Click “Window”
- Click “Preferences”
- Expand “General”
- Expand “Network Connections”
- Open “SSH2”
- Go to the tab “Key Management”
Now you can eighter use an existing key or add a new one.
To load an existing key:
- Click “Load Existing Key…”
- Within the explorer window go to the directory where you stored the existing key for GITHUB and select it
- Enter the “Passphrase” when your key needs a passphrase
- Confirm the “Passphrase” when your key needs a passprase
- Click “OK”
It is preferred to make a new key instead of loading an existing one. Therefor:
- Click “Generate RSA Key…”
- Leave the passphrase empty (this simplifies the later handling)
- Click “Save Private Key…”
- Click “OK” to confirm that you want no passphrase
- Within the explorer window go to the directory where you want to store your keys
- Click “OK”
- Click “OK” once more
- Open GITHUB within your Internet Browser and login to your account
- Click “Account Settings…”
- Click “Public SSH Keys…”
- Click “Add another public key…”
- Enter “Zend Studio” as Title
- Copy the public key which you can see within Zend Studio as key into GitHub
- Click “Add Key…”
At last, regardless is you added an existing or created a new key, click “OK” within Zend Studio to close the preferences window and save the settings. You have successfully prepared Zend Studio to be used with your GitHub account.
Connecting to GitHub
The next step is to connect our GITHUB account to Zend Studio.
- Open Zend Studio
- Click “File”
- Click “Import…”
- Now expand “GIT”
- Select “Projects from GIT”
- Click “Next”
- Click “Clone…”
- Open a Internet Browser and login to your GITHUB account
- Click your ZF2 repository
- Click”HTTP” (it is near SSH, HTTP…)
- Copy the link which is shown (should look like “https://accoutname@github.com….” into your clipboard
- Copy the link from your clipboard into “URI” within the opened Zend Studio tab
- Enter your GITHUB password within “Password”
- Click “Next”
- Click “Next” once more to add all branches
- Enter the directory into “Directory” where you want to store the local repo (this should be an empty/new directory)
- Click “Finish”
This step will take several minutes as it now copies your remote repository to your local drive.
We’re done. You have successfully connected your GIT repository with Zend Studio.
You can now do your complete work from within Zend Studio. It supports all needed tools. Some of these tasks I will describe now:
Create a new branch
To create a new branch do the following:
- Right-Click your project
- Click “Team”
- And then click “Branch…”
- Click “New Branch…”
- Enter the name of the new branch to “Branch Name…”
- Select “Rebase” (with “None” you will not be able to commit your changes to your GitHub account)
- Click “Finish”
Now you’re working within a new created branch.
Keep in mind to use the proper naming convention.
“hotfix/ZFxxxx” for quick fixes and “feature/XXXX” for new features. Always use “ZFxxxx” when there is a related issue within Jira.
Committing changes
Committing your changes is also quite simple:
- Right-Click your project
- Click “Team”
- Click “Commit”
- Enter a note about what you’re committing within “Commit Message…”
I always use notes like this:
[feature/MyFeature] - changed something - another change
- Select the files which you want to commit
- Click “Finish”
Now your change is stored in your local repository. To send your changes to your GITHUB repository you need to do the following:
- Right-Click your project
- Click “Team”
- Click “Push to Upstream”
- Click “OK”
At last your changes are also available within your GITHUB repository.
To have your change integrated within ZF2 you need to make a “Pull-Request” from your GITHUB repository.
The above description does not only work for Zend Framework but can also be used for other projects.
I hope you find this post interesting so I wish you a good work and have fun with GIT.
Greetings
Thomas Weidner
I18N Team Leader, Zend Framework
Zend Framework Advisory Board Member
Zend Certified Engineer for Zend Framework
Mockery is my Mock Object Framework for PHP 5.3, geared towards replacing existing mock object solutions (like PHPUnit Mock Objects) with a simpler, more flexible and more intuitive alternative with a fuller feature set. All that aside, this blog post is about some of the features Mockery 0.7 will add shortly. When I set out to write Mockery, I did so knowing that I’d need to settle on some new terminology. Here’s what I came up with:
1. Partial Mocks
This is the least debateable term since it’s adopted by other Test Double frameworks. A partial mock is a mock object where only some methods are mocked. The non-mocked methods continue to behave as defined in the class being mocked. It’s mainly useful where you only want to create a mock object which simulates a concrete implementation of an abstract class (in the absence of an existing concrete class).
2. Proxy Mocks
In writing Mockery I was faced with a dilemna when it came to classes and methods marked final. You can’t override such classes so the normal mocking implementation just doesn’t work. To resolve this and force the mocking of final methods, I settled on using the Proxy Pattern, thus the term Proxy Mock. The implementation is straightforward. Since we can’t override final methods/classes – we don’t. We ask the user to instantiate an actual of that type and pass it into Mockery. Mockery then embeds the object into a Proxy class, which can intercept method calls to the real objects and replace those methods with alternative behaviours (i.e. mock expectations). This is a simple way to mock final methods. Of course, it comes with a price – the Proxy won’t carry the real object’s type. And people wonder why I really don’t like the final keyword in PHP…
3. Alias Mocks
An Alias Mock is based on using class_alias() in PHP 5.3. The concept was to allow mocking of static methods by aliasing the class hosting the static methods, the alias being a mock object class. It works best in an autoloading scenario coupled with unit test process isolation. The next term will explain the main reason why aliasing was ceased upon as something worth including.
4. Instance Mocks
The real reason why Alias Mocks were created. An instance mock refers to a mock object class created as an alias to a real class. This basically is a cheap method of intercepting the “new” keyword without resorting to PHP extensions. All attempts to instantiate the real class, will instead provide an instance of the mock object alias. Again, it works best with autoloading (require_once calls would create a fatal error).
If you still don’t understand some facets of these – the Mockery README on github.com is a good place to see the API in action for these.
My question to readers is whether these four terms are acceptable? It’s hard to trust any single individual (i.e. me) to dream up appropriate terms, so feel free to offer suggestions or alternatives in the comments.
Last week I announced the launch of the HTTP Archive. The feedback has been very positive. I’ve already heard from a handful of performance gurus who have downloaded the data and done additional analyses. This was a major goal of the project and I’m excited to see it happening.
I made a few changes to HTTP Archive that I wanted to share in this blog post.
First, there’s a potential for apples-to-oranges comparisons because the list of URLs “crawled” by HTTP Archive changes from run to run due to errors and changes in the “top N” sorting of sources like Alexa and Fortune 500. When comparing two runs it’s unclear if differences are caused from a change in the sample set or actual changes in Internet behavior. This was exemplified by this tweet from @orionlogic:
Website flash usage dropped %16 since 6 months http://t.co/XI7IWvc (via httparchive.org + @souders ) #noflash
The link contains two pie charts from HTTP Archive:
The issue is that the number of URLs grew from ~1000 in October 2010 to ~17,000 in April 2011. Those additional 16,000 websites have different behavior when it comes to using Flash. If we compare Nov 15 2010 to Mar 29 2011, both of which use ~17,000 URLs, the change is only 2%.
I made some changes to mitigate this issue.
- The first three runs that were done with only 1000 URLs are now hidden in the UI. The data is still available.
- Similar confusion can happen when viewing trending charts. The fix there is to use the “intersection” set of URLs across all runs. I added a note next to the “choose URLs” pick list to point out the benefit of choosing “intersection”. I moved the plot of “URLs Analyzed” to the top so it’s more apparent when the number of URLs changes from run to run.
On another note, Nicole Sullivan suggested I add a trending chart for transfer size and number of requests for Flash, in addition to the existing charts for HTML, JavaScript, CSS, and Images. The hypothesis was this might explain the increase in image sizes that I presented in my previous post – perhaps Flash was on the decline being replaced by HD images. The chart shows a slight decline in the size and number of Flash files, but not enough to explain the 61 kB increase in image transfer size.
I made several other fixes that are less visible. Several people have submitted requests for new stats. I’ll keep knocking those off and blogging about them.
More than 200 Hadoop developers and enthusiasts congregated on the Yahoo campus for the monthly HUG meeting on March 16-Th. As always, they were treated to some enlightening presentations in addition to good food and beverages.
After the usual 30 minutes of socializing and networking, Milind Bhandarkar from LinkedIn, kicked off the evening with a really enlightening talk on "Scaling Hadoop Applications." As a well-respected Hadoop expert and a founding member of the Hadoop team at Yahoo in 2005, Milind was able to articulate the issues and solutions very succinctly. His talk was especially interesting because he tied well known theorems and laws around scalability to the ground realities on the Hadoop clusters today.
Here are the slides from Milind's talk.
Following is the video of the presentation.
This was followed by an interesting talk on "HDFS Federation" by Yahoo's Suresh Srinivas. HDFS Federation is a major feature slated to come out in the near future and Suresh gave the audience an in-depth and in-the-trenches look at this key feature. This also tied into the theme of the day nicely as federation is all about scaling today's Hadoop clusters and making them bigger and faster.
You can find the slide-deck from Suresh's presentation below.
This talk concluded an interesting HUG. Thanks to all the Hadoop users and presenters who attended the March HUG and hope to see you all again soon. If you have an interesting topic related to Hadoop and would like to present, please don't hesitate to contact us.
As mentioned before, I’m building a test and development environment for the Wikimedia Foundation using OpenStack and MediaWiki. I wrote a MediaWiki extension for this project, and have added basic Semantic MediaWiki support to this extension. People have asked me a number of times why I chose to use MediaWiki to build the OpenStack manager, and this post will be an example of why I went this route.
The self documenting architecture
Server documentation is always out of date, and it annoys me. Sure, in a virtualized environment you can query the controller to get information about systems, but that’s only good to a point. Usually most controllers aren’t well suited to do documentation, and it kind of sucks to have to query a system to get that documentation. I like to do system documentation in a wiki. I can organize it how I want, and add any additional information that I want; this may not be supported by a controller. I also want to be able to link to my other documentation from my resource pages, or link from other documentation to my resource pages. This means I usually end up documenting my architecture in a wiki and as normal it’s all out of date.
No more! Since the OpenStackManager extension is managing all of the LDAP and OpenStack Nova resources, it can also add documentation for the resources while it’s at it. The extension will take all of the information and add it to a page based on the resource’s ID. The content of the page will be a mediawiki template (Nova Resource), with arguments and values for each piece of data. Here’s the current format of the template:
{{Nova Resource
|Resource Type=instance
|Instance Name=%s
|Reservation Id=%s
|Instance Id={{PAGENAME}}
|Private IP=%s
|Public IP=%s
|Instance State=%s
|Instance Host=%s
|Instance Type=%s
|RAM Size=%s
|Number of CPUs=%s
|Amount of Storage=%s
|Image Id=%s
|Project=%s
|Availability Zone=%s
|Region=%s
|Security Group=%s
|Launch Time=%s
|FQDN=%s
|Puppet Class=%s
|Puppet Var=%s}}
These pages are created in the Nova Resource namespace, so that it’s possible to restrict write access to that namespace. The pages will be updated whenever certain resources are added, configured, or deleted (currently only instances are supported).
An architecture with queryable semantic data
The OpenStackManager extension enables semantic support for the Nova Resource namespace, if Semantic MediaWiki is available. This allows you to add semantic annotations to the Nova Resource template.
By making semantic annotations for all of the resource data, you can then use those annotations in interesting ways. I have some example queries at the reference implementation.
An example use case of this semantic data
Semantic MediaWiki has a bunch of output formats. One really interesting output format is JSON. The first thing that came to mind when I noticed this format was available was: how can I use this on the instances?
I fairly often need to run commands on a number of systems. I use dsh for this, and don’t necessarily like it. I don’t like it because I need to keep the dsh groups updated. This is like documentation. It’s a manual process, and as such, it’s always out of date. Well, since the wiki is documenting the instances as they are created, deleted, and re-configured, then it’s always up to date. Since we have all the instance data semantically annotated, we can also pull that information, and since there is a json export, I can use the json data in scripts on the command line.
As an example, here’s a simple dsh written in python using system groups pulled via semantic queries. First, take a look at the instances we’ll be running this against. Now, let’s take a look at the output:
laner@nova-controller:~$ python ddsh.py -p ganglia "echo hello" Running "echo hello" on instance "i-00000010.sdtpa.tesla.wmnet" hello Running "echo hello" on instance "i-00000011.sdtpa.tesla.wmnet" hello
Ideas?
This is just a proof of concept of what can be done. I probably won’t actually use this script. I can keep my dsh groups up to date with puppet and likely will. I’m sure I’ll find some really great uses for the semantic data though.
Have any ideas on how to use a system like this? Let me know in the comments!
<#comment hash="f92e3f4a596ee1383542fa82e3050512" /> <#comment hash="9d6ee31bc358db3224830f8469fa13c0" />Related posts:
- OpenStackManager version 1.2 released
- Announcing OpenStackManager extension for MediaWiki
- Building a test and development infrastructure using OpenStack
By 2013, the number of internet-connected mobile devices will exceed the number of internet-connected PCs. Apple’s App Store has paid out more than $2 billion to date to developers of mobile applications, and this is just the beginning. Mobility is the biggest disruption in the industry today and people everywhere are working, playing and learning differently thanks to mobile technologies. Whether at work or on personal time, people have high expectations for a rich and productive mobile user experience. In a recent survey of our Zend Server customers, more than 70% of respondents reported they are either delivering or planning to deliver rich mobile experiences to their users. The need to move quickly and support ever more mobile platforms creates a perfect storm for the emergence of what Gartner calls client-cloud applications*. The client represents a rich application on an internet-connected device while cloud is a set of consumed services hosted in an elastic, scalable cloud platform.
It’s only natural that developers are hungry for a flexible, productive platform that will help them deliver native, connected mobile experiences rapidly. With this in mind, at Zend we have been working closely with our partner Adobe and thanks to that collaboration we are today jointly announcing Adobe Flash Builder 4.5 for PHP (Zend Site, Adobe Site).
This new product appeals to developers who want to build creative and capable mobile apps quickly and efficiently. Flash Builder 4.5 for PHP merges the workflows of mobile client and PHP-driven cloud services, and among other things, enables its users to:
Build apps that run natively across multiple platforms and devices including iOS, Android, and Blackberry Seamlessly create mobile projects that leverage the Flash Platform and PHP Leverage wizard-driven workflow to easily wire client-side Flex and server-side PHP Enhance the developer experience, with multi-device integrated debugging across desktop (IDE), mobile device (client app) and server (PHP) — it’s a game-changerAnd in case you didn’t quite notice my previous mention of iOS – yes! Adobe will enable standalone applications built with Flex on the iPhone and iPad**, and with that enable our joint customers to target a broad set of mobile devices.
I am very excited re: the possibilities this integration opens up. For a better idea of what this looks like Kevin Schroeder from Zend has created a very cool flash demo.
As I previously noted I believe mobile is going to be one of the biggest game changers, our customers clearly recognize that and we intend to be there to make them successful. PHP is ideally positioned to deliver the services to native, connected applications due to its high-productivity and proven scalability which enables the rapid delivery of optimal user experiences.
A big thank you to the Adobe team – we have made friendships through this partnership and appreciate the investment they are making in PHP. With our friends at Adobe and our ongoing development of the Zend PHP Cloud Platform, we are providing our users with the tools and development experience they need to master internet-connected mobile application development and seize the opportunities created by the mobile revolution.
The Adobe Flash Builder 4.5 for PHP bits will ship within 30 days. Register on Zend’s or Adobe’s Web site if you want us to let you know when you can download the bits.
Enjoy!
---
* Gartner, March 2011 - Client-Cloud Applications: The Rebirth of Client/Server Architecture
** Adobe compiles ActionScript down to native ARM code. Once it's compiled and packaged, there's no interpreter and the resulting app is fully compliant with Apple's App Store guidelines. iOS support in Flash Builder 4.5 will ship 30 days after the product is released and will be a free update.
There are a few things many PHP developers should be familiar with. We should be familiar with PEAR packages. We should be familiar with the PEAR installer. More and more of us actually are getting familiar with running PEAR channels. The problem that some of us have, like me, is that we’re working against an architecture which focuses on the central PEAR repository. It’s the elephant in the room, so to speak, and we spend year after year working around or completely ignoring it.
While I do criticise PEAR through this statement, that would be missing the point. PEAR is a story of two parts – the distribution mechanism using the PEAR installer and PEAR channels, and the centralised package repository. The problem is splitting the two, and I don’t believe the PEAR group (along with the rest of us) have really considered that which is a shame. As Till Klampaeckel recently stated, “PEAR packages are not as easy to use as some code you copy-pasted off the Zend devzone or phpclasses. While I agree, that we should try to make it just as easy, it’s just not one [of] PEAR’s goals right now.”.
That kind of sums up my issues with PEAR as a distribution mechanism – it’s not as easy as it could be and it really ought to be a primary goal. It’s not… PEAR obviously treats its own package repository as an elite citizen. You don’t need to discover its channel, or use a channel prefix, or go consulting Google to find something useful (it has a neat categorised package listing). No other PEAR channel set up independently has those advantages and the extra steps can’t compete with a simple download from a website. To install from another channel, you need to go search Google for a suitable library, check if it even has a PEAR channel, find the channel URI, guess the channel prefix – and then finally install something. Since channel discovery is probably not automatically performed (by default), dependency resolution can lead to more pain. This assumes hosting a PEAR channel is easy (which it is given the recent explosion of them using sane channel hosting tools like Pirum). Maybe that’s our trigger – PEAR channels are easy now. Suddenly, all the libraries I use are, or will be, hosted on a PEAR channel. Even Zend Framework 2.0 is heading to PEAR split by component. It’s fantastic!
Since we seem to like blaming the PEAR Group, and getting that ball kicked back to us, it’s time we did something useful. We’ve spent too much time ignoring PEAR as we grew apart from it with our frameworks, standalone libraries and custom plugin architectures. We’re making life harder for ourselves in doing so. Stuart Herbert has posted a short article to gather requirements for a Pear Channel Aggregator. I strongly suggest that interested PHP programmers drop by and add a comment with some suggestions/feedback. Let’s get this thing moving forward!
Gathering Requirements For A PEAR Channel Aggregator
Once that article went up, we started seeing that people are out there working on the problem. I don’t see any of them as being a final solution, and God forbid we just adopt a half-measure for some limited subset of requirements. It would be far preferable, if not essential, to see a complete solid solution meeting everyone’s varied requirements. There’s no excuse not to. Either we want a robust complete solution or we don’t.
As to the concept of a channel aggregator, I view it as a shift away from PEAR’s focus on a centralised repository to a focus on supporting and enabling an open decentralised system with one (or more) focal points. PEAR’s own channel should just be one among many. A channel aggregator should do away with channel prefixes, channel discovery, compulsory reliance on PEAR packaging/channel hosting (git is too popular to ignore), and support combining the package details from potentially hundreds of channels/git repos into one or more competing aggregators that offer easier lookup, rankings, user feedback, etc (baby steps obviously before we go nuts). Installing any package from any channel or supported git repo should only require hitting up a channel aggregator with its name – no other mucking around. It would become an ecosystem that is impossible to ignore and obvious to utilise for hosting and distribution, not simply of libraries but of components, plugins and anything that can be considered installable. I’m also being careful not to overly point at one aggregator – any decent decentralised system should be capable of supporting multiple nodes with indifference, working in tandem or not.
The other side of the problem is PEAR channel hosting and packaging. Hosting is easy. Setup Pirum and you’re ready to go. Packaging needs a better tool. A script that can accept a simple configuration file, build the package, and optionally include a cryptographic signature (because who are we kidding, we need the system to be secure). While building a robust system, overly relying on traditional PEAR distribution methods while the PEAR community has been slaving away on PEAR2 and Pyrus would be a tragedy. Do it right, or not at all. Ignoring all that work, if it proves useful, makes no sense. At the same time, allowing them to create limitations in the requirements would be a serious error.
Regardless, we really need the PEAR Group’s participation because, frankly, their input and support will make the difference between evolving PEAR distribution (with or without the existing PEAR efforts) with us to meet our needs, or sundering the community into two approaches that probably won’t be compatible.
I would hope that we’re capable of improving the entire infrastructure – not just one piece of it. Hopefully, with the assistance of our PEAR colleagues. And hopefully by avoiding segmentation of the overall community. If you’re working on pieces of the puzzle, I appeal to you to band together. Let’s not let fiefdoms derail what could be a massive boost to publishing code (not just libraries) online.
Gathering Requirements For A PEAR Channel Aggregator
In case you missed it the first time! ![]()
I have never before spent such much time preparing a keynote, and it looks like my attempts to present something a bit different were appreciated. You can find a copy of the keynote and video of it at the MySQL conference web site.
I would, however, like to offer a correction. A couple of days before the conference I had been told by someone (someone that I had all reasons to believe was a reliable source) that there were only 54 of the original 400-450 MySQL Ab people left at Oracle. I asked a lot of former Oracle employees if this figure could be accurate and everyone told me that that the figure sounded low, but they could believe it. I could not find anyone in Oracle willing to comment upon it before the keynote. Now I have finally been able to verify this and there is still closer to 200 original MySQL Ab people left at Oracle working with MySQL.
My statement that it's only the InnoDB, NDB and most part of the replication team that are intact and that there is only 2 original core MySQL developers left is however accurate. My apologies for the wrong initial number of total people left, but I was in good belief regarding it!
To make things clear, I am not in any way trying to downplay the hard work the MySQL developers (and other MySQL people) are doing. They are doing an amazing job, with the resources they have at their disposal.
My point was that I am worried that the MySQL developer and support ecosystem is slowly falling apart because people are leaving Oracle and going to other companies where they are not anymore participating in the MySQL development. That is why companies like Monty Program Ab and SkySQL are important as we help keep the ecosystem together by having people working on the same thing they did before!
Apart from this, it has been a great conference and it's been a true pleasure to meet all of the 'old MySQL conference gang' yet another time!
As extremely happy user of crash-safe-slave functionality since 4.0, I hereby welcome this feature in upcoming 5.6 release!
5.6 seems to be strongest production-support release since introduction of InnoDB, solving issues of long running high performance systems, that were forced to use Percona/M@FB/GooglePatch/.. before. Good!
The MySQL team has been very productive recently with many bug fixes, many features implemented, a great 5.5 release and what appears to be an even better 5.6 release. The changes in 5.6 are a really big deal. I can't wait to stop porting rpl_transaction_enabled to get crash-proof slave state.
The value-added community continues to push the state of the art for those who can't wait for the great new features to be GA or for those with special problems. I am interested in trying parallel replication apply from Tungsten, working with Monty Program to improve monitoring in MariaDB and working with Percona to improve InnoDB quality-of-service for high-throughput OLTP.
I have begun to catch up on my reading to figure out what has changed for 5.6. I probably need another week to finish reading the many useful blogs and presentations. DimitriK published a 5-part performance report that I have yet to start (1, 2, 3, 4, 5). I am happy to find that few more MySQL developers at Oracle have begun blogging (Oystein, Didrik, Luis, Olav). How do I type the accented "O" in Oystein?
First the InnoDB changes:
- Information schema system tables - InnoDB has moved a lot of data to IS tables. The ones listed here are not the most interesting to me but the migration in general makes life easier for me. I don't know if these are new in 5.6, but they are more interesting to me than the ones described in the InnoDB blog:
INNODB_BUFFER_PAGE,INNODB_BUFFER_PAGE_LRU, andINNODB_BUFFER_POOL_STATS. - No more kernel_mutex - Wow! This is a huge deal for multi-core performance and will enable even more improvements in future releases. As part of the admission control feature we have been trying to reduce kernel_mutex contention and noticed that 5.5 already was much better for that.
- Persistent index cardinality statistics - when a server is restarted all index cardinality stats are computed in MySQL 5.1 and without the Facebook patch, this is serialized because of LOCK_open. That can create too many stalls on restart. I assume this allows a DBA to populate the stats table manually which can be very useful for a deployment with many scale-out slaves to prevent query plans from changing between slaves. Other details are at Oystein's blog.
- Multi-threaded purge - I want this right now. This and parallel replication apply make it possible to deploy large databases on slaves using disk setups that match what is available on a master. Otherwise you must read Yoshinori's slides to find the workaround.
- Data dictionary LRU helps if you have a lot (or too many) tables.
- Page cleaner thread - This is a good step but I wonder if it is enough and need to read code or wait for a Percona performance report. Has anything been done to prevent stalls when the async flush tries to flush too many pages in one call? The problem is that all dirty pages from an extent are flushed when at least one page with a too-old LSN is in the extent. Flushing neighbor pages can increase the number of pages flushed by up to 64X. I have seen benchmark servers stall for 60+ seconds while 200,000+ pages were flushed when the async limit was reached. This can be very painful in servers that are able to cache the database and fill with many dirty pages.
- memcached API for InnoDB - this should enable HandlerSocket like performance while supporting an API that is already supported by most clients (PHP, Java, Python, ...). I think that non-SQL interfaces to InnoDB will be a big deal.
- metrics table for InnoDB counters - I spent too much time adding counters to SHOW STATUS for InnoDB. Now you don't have to.
- parallel replication apply - I am confused. The labs page and worklog state that this is for RBR. I want parallel apply for SBR to overcome slave replication lag from IO bound slaves. Parallel apply provides parallel IO requests. Per the blog by Luis, I think this supports SBR.
- system tables for slave state - I don't know if WL2775 describes what was implemented. Feedback from gmaxia makes me wish for more docs. Hopefully I can stop porting rpl_transaction_enabled and begin to use this instead. I need to read the blog by Mats.
- replication checksums - I want binlog event checksums. I have not needed them for a long time since a certain bug was fixed but I would rather not worry about that problem again. Mats also wrote about this.
- informational log events - I need the original SQL including its query comment in the binlog if I am to use RBR. With this feature I have one less excuse for not trying RBR.
- remote binlog backup - We already have this in the 5.1 Facebook patch thanks to a backport by Harrison. It lets you archive the binlog almost as soon as it is written.
- universal group identifiers - This might be the equivalent of global transaction IDs from the Google patch.
- optimized RBR logging - This will be a big deal for tables with BLOB columns that get frequent updates to non-BLOB columns. I have a few of those.
- Time delayed replication - I don't need this but many others will.
There’s a long list of interesting stats to be added to the HTTP Archive. I’m planning on knocking those off at about one a week. (If someone wants to help that’d be great – contact me. Familiarity with MySQL and Google Charts API is a plus.)
Last week I added an interesting stat looking at the cache lifetime being specified for resources – specifically the value set in the Cache-Control: max-age response header. As a reminder, the HTTP Archive is currently analyzing the top ~17K websites worldwide. Across those websites a total of ~1.4M resources are requested. The chart below shows the distribution of max-age values across all those resources.
56% of the resources don’t have a max-age value and 3% have a zero or negative value. That means only 41% of resources are cacheable. In more concrete terms, the average number of resources downloaded per page is 81. 33 of those are cacheable, but the other 48 will likely generate an HTTP request on every page view. Ouch! That’s going to slow things down. Only 24% of resources are cacheable for more than a day. Adding caching headers is an obvious performance win that needs wider adoption.
As my friends know I’m a big fan of Ingite presentations, and the ones from MySQLConf 2011 last week were just outstanding. If you didn’t had the chance to see them live, here are the links to the best of them.
This was by far my favorite: “Scale Fail” by Josh Berkus; extremely funny on how to build sites that don’t scale. Highly recommended:
The other two were great ignite talks also, and definitely worth the time checking out: “Causes of Downtime in MySQL” by Baron Schwartz and “The Art of Data Visualization” by David Holoboff.
Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator. We have already branched for a 0.1 release, which we hope to push in the next few weeks. Given all this activity, I thought it would be a good time to write a post on the motivation behind HCatalog, what features it will provide, and who is working on it.
Why Did We Create HCatalog?
Out of the box Hadoop provides the HDFS file system for users to store their data. File systems are nice because they provide a simple interface. Users can easily copy data into the file system and run jobs against that data. However, for more complex data processing tasks, the file system abstraction is not rich enough. It forces users to know where data is located, what format it is stored in, how it is compressed, and what its schema is. Consider, for example, a Pig Latin script used to do ETL on raw web logs:
A = load '/data/raw/ds=20110225/region=us/property=news' using PigStorage()
as (user:chararray, url:chararray, timestamp:long);
...
If the grid administrator needs to move this data to a new location, or compact multiple files into an archive (such as har), or the data producer changes the schema or starts writing it compressed, this Pig Latin script will need to be changed. Pig and MapReduce job specifications are tightly coupled with the data storage. This inhibits data producers' ability to improve their processes. They and their users are forced to go through a painful transition process to take advantage of any improvements. This lack of data storage independence is exacerbated by the fact that in most large Hadoop installations there will be users using different tools to access their data. So a change in storage format will ripple through multiple groups and require coordination across disparate data processing tools.
The fact that in many Hadoop installations different users will be using different tools brings up another issue. These tools do not share a common notion of schemas or data types. Pig and Hive have different, though similar, data models. MapReduce leaves data types to its users. This can make sharing a data set across users with different tools difficult and error prone.
When data consumers are building new applications or want to query new data sets, finding the data in a file system is difficult. Generally, a user has to contact the group that produces the data and ask them where they write their data, what format it is in, and what its schema is. Some organizations use wiki pages or similar mechanisms to record this information. But such pages inevitably get stale as changes to the data storage are made.
Even once users know where data is, file systems do not help them know when it is available. In a complex data processing environment there will be many users waiting for the availability of a given data set. This is particularly true of the foundational data sets that form the basis for much of an organization's data processing. With a file system, the only way to know if data is available is to do an 'ls'. Hundreds of users banging away with 'ls' commands every few seconds does not help HDFS' performance.
Finally, a file system tends to become a dumping ground with little knowledge of how the data there should be managed. How long should data be kept? Are there legal reasons to store it on tape before deleting it? If so, how do you know what has been archived and what has not? With a file system all of these issues tend to be addressed by conventions. "Data in this directory is cleaned up after 30 days." "After data is archived a '.archive' file is placed in its directory." Often each data creator has to answer these questions, and they tend to answer them differently. Thus an organization ends up with not one data management policy, but twenty, thirty, or a hundred policies.
How HCatalog Addresses These Issues
Hadoop needs a better abstraction for data storage, and it needs a metadata service. HCatalog addresses both of these issues. It presents users with a table abstraction. This frees them from knowing where or how their data is stored. It allows data producers to change how they write data while still supporting existing data in the old format so that data consumers do not have to change their processes. It provides a shared schema and data model for Pig, Hive, and MapReduce. It will enable notifications of data availability. And it will provide a place to store state information about the data so that data cleaning and archiving tools can know which data sets are eligible for their services.
HCatalog takes Hive's metastore, and wraps additional layers around it to provide these services. It comes with HCatInputFormat and HCatOutputFormat for MapReduce users, and HCatLoader and HCatStorer for Pig users. Taking the Pig script example above, using HCatalog it looks like:
A = load 'raw' using HCatLoader();
B = filter A by ds='20110225' and region='us' and property='news';
...
Notice that Pig no longer knows or cares about the file location or storage format. It is just loading a table named 'raw'. If an administrator decides to relocate the files that store raw, or switch to a better storage format, this Pig Latin script does not change at all. Also notice that the user no longer has to declare the schema. HCatLoader communicates that automatically to Pig. HCatalog uses Hive's data model. When loading data into Pig, the types are translated to Pig types. For example, a Hive struct becomes a Pig tuple. When used by MapReduce, the value that HCatInputFormat returns is an HCatRecord, which is an ordered list of typed data. At the same time, this does not force Pig or MapReduce to scan the whole table. Pig knows how to push the filter statement into HCatLoader so that only the appropriate files are read.
HCatalog includes Hive's command line interface so that administrator can create and drop tables, specify table parameters, etc. It will also allow users to explore what tables are available and what their schema is.
HCatalog also provides an API for storage format developers to tell HCatalog how to read and write data stored in different formats. Currently HCatalog knows how to read and write RCFiles. But if data is stored in a different format, a user can implement an HCatInputStorageDriver and HCatOutputStorageDriver to tell HCatalog how to translate between your data storage and record format HCatalog uses. Which StorageDriver to use is stored at the partition level. So if you need to change how your table is stored when you already have a year's data in it, there is no need to reprocess the data. New data can be written in the new format while the old data stays in the old format. HCatalog handles using the correct StorageDriver to read each partition, allowing users to read across partitions, never knowing they are in different formats.
HCatalog Is Brought To You By...
HCatalog is a collaborative effort between members from the Apache Pig, Hive, and Hadoop projects, plus new contributors. Most of the new code has been written by Yahoos, while the Hive team has been very helpful in providing design feedback and pushing HCatalog's changes into the Hive metastore.
Where Can You Learn More?
You can learn more about HCatalog, download the source, file JIRAs, and join the mailing lists on HCatalog's website. You can also come join us at the Hadoop Summit.




















