When deploying a pure-flash server I want a database engine that optimizes for compression as more compression means less flash must be purchased. The engine can do extra IOPS in search of more compression. Column-wise storage is an example of a feature that can improve compression at the cost of extra disk reads.
When deploying a pure-disk server I want a database engine that has optimizations to reduce disk IO. The InnoDB insert buffer reduces disk reads done for secondary index maintenance. TokuDB and LSM trees eliminate random disk writes.
Is there one database engine that optimizes for compression and saves IOPS? Is this an example where one size does not fit all? A log-structured engine can save IOPS by doing fewer random writes. But it will also use more disk space because old versions of data are not compacted immediately. An update-in-place engine might get less compression than possible because writes are frequent and compression must be fast. Using InnoDB as an example it gets less than half of the compression rate that is feasible for data that I like. Support for prefix compression and larger pages (32kb or 64kb) will improve this but InnoDB will always get less compression than possible because compressed pages are rounded up 2kb, 4kb, 8kb with key_block_size. I think that is a problem any time block compression is used for an update-in-place engine.
Following on from the discussion on modules, we can hook into the event system to do module specific bootstrapping. By this, I mean, if you have some code that you want to run only if the action to be called is within this module, you can hook into the Application's dispatch event to achieve this.
As you already know, your module's init() method enables you to create a hook to the bootstrap event:
module/Simple/Module.php
<?php namespace Simple; use ZendModuleManager, ZendEventManagerEvent, ZendEventManagerStaticEventManager, ZendModuleConsumerAutoloaderProvider; class Module implements AutoloaderProvider { public function init(Manager $moduleManager) { $events = StaticEventManager::getInstance(); $events->attach('bootstrap', 'bootstrap', array($this, 'onBootstrap')); } // Assume that getAutoloaderConfig() & getConfig() are defined here public function onBootstrap(Event $e) { $application = $e->getParam('application'); $config = $e->getParam('config'); } }
The $application has an Event Manager which you can then use to hook into various stages of the routing and dispatch process. So we can create a callback for the dispatch event which occurs just before the controller action is called:
public function onBootstrap(Event $e) { $app = $e->getParam('application'); $app->events()->attach('dispatch', array($this, 'onDispatch'), -100); }
This code will call our module's onDispatch() method when the dispatch event is fired.
Within our onDispatch(), we do:
public function onDispatch($e) { $matches = $e->getRouteMatch(); $controller = $matches->getParam('controller'); if (strpos($controller, __NAMESPACE__) !== 0) { // not a controller from this module return; } // Do module specific bootstrapping here }
This code is called after routing has occurred so we can retrieve the RouteMatch object which contains information about the controller and action that is about to be called. We test if the namespace of the controller matches the namespace of the module and if it does not, we return.
If it does, then we are about to dispatch to a controller action within our module and so we run do any specific code that we may wish to.
As an example we could, for instance, change the layout for all actions within this module by accessing the layout view model:
public function onDispatch($e) { $matches = $e->getRouteMatch(); $controller = $matches->getParam('controller'); if (strpos($controller, __NAMESPACE__) !== 0) { // not a controller from this module return; } // Do module specific bootstrapping here // Set the layout template for every action in this module $viewModel = $e->getViewModel(); $viewModel->setTemplate('layout/simple'); }
In this case, I have set the layout to layout/simple rather than the default of layout/layout. Of course, you could also set variables into the ViewModel for use in your layout script too.
I've just finished presenting the results of our Drupal and Devops survey at the Belgian Drupal User Group meetup at our office
and I've uploaded the slides to slideshare for the rest of the world to cry read.
Honestly I was hoping for the audience to prove me wrong and I was expecting all of them to claim they were doing automated and repeatable deployments.
But there's hope...
If your knowledge of constructors ends with “the place where I put my object initialization code,” read on. While this is mostly what a constructor is, the way a developer crafts their class constructor greatly impacts the initial API of a particular class/object; which ultimately affects usability and extensibility. After all, the constructor is the first impression a particular class can make.
Constructors, in their current form, have been in PHP since 5.0.0. Previous to 5.0, PHP loosely followed the style similar to that of C++ where the name of the method matching the name of the class would act as the class constructor. PHP 5 brought us the __construct() “magic method” which greatly formalized the new object initialization routine.
Before jumping into some of the topics covered in this post, there are a few things you might want to be familiar with. First, be familiar with the SOLID principles, particularly the S (single responsibility principle), the L (Liskov substitution principle, commonly referred to as the LSP), and the D (dependency inversion principle). More to the point of the latter, review a previous post on Dependency Injection in PHP for background dependency injection specific to PHP.
The Constructor Signature
In PHP, you create a constructor by adding a method called __construct() to your class. The __construct() method is an instance method and as such, is not marked static. For all intents and purposes, consider the __construct() magic method as a special type of static object factory, one which will always return the type of the object requested via the new keyword.
class Foo {
public function __construct() {
}
}
$object = new Foo();
In the above code, PHP will, upon executing new Foo(), internally create a new object from scratch, execute __construct() in the Foo class, and assign this object to the variable $object. Pretty standard stuff. What’s important to know here is that before new Foo(), the object did not exist. It is this fact alone that makes this completely different from any other kind of instance method. That said, without getting into the gritty details, it is this fact alone that excuses the __construct() method from the same rules of the LSP that might apply to other instance methods.
This means that all of the following are legal:
class Foo {
public function __construct() {
}
}
class Bar extends Foo {
public function __construct(ArrayObject $arrayObj, $number = 0) {
/* do stuff with $arrayObj and $number */
}
}
class Baz extends Bar {
public function __construct(Bar $bar) {
// yes, this is the proxy pattern
}
}
The above, with E_STRICT enabled, will not produce a warning. Yet, if you renamed all of the __construct methods to anything else, they will produce a E_STRICT warning like:
Strict standards: Declaration of Bar::somemethod() should be compatible with that of Foo::somemethod()
Why is this the case? Simply put, the LSP referrers to sub-types of a particular object, and since before the __construct() method, no type exists (yet). This rules simply cannot apply to something that does not exist. For a more detailed response, go here.
What you should take away from this is that the best-practice is that each concrete object has a constructor with a signature that best represents how a consumer should fully instantiate that particular object. In some cases where inheritance is involved, “borrowing” the parents constructor is acceptable and useful. Furthermore, it is encouraged that when you subclass a particular type, that your new type should, when appropriate, have its own constructor that makes the most sense to the new subtype.
At this point, it should be noted that most other languages do not allow constructors to be marked final, be abstract, or be marked as statics (see above on the static note). Moreover, constructors should not appear in interfaces. In PHP, these rules do not apply, and are all possible. For the reasons listed above, a developer should avoid the practice of marking constructors final, making them abstract, and putting them interfaces, assuming they are trying to utilize PHP’s OO model in a SOLID way. In PHP 5.4, it is also worth knowing that by having constructors in interfaces breaks the common expectation that subtypes are capable of creating their own constructors in favor of enforcing a particular method signature.
Constructor Overloading
PHP does not have method overloading. This also applies to constructors. A class of a specific type can only have one constructor. Since this is the case, PHP developers sometimes loosen a methods signature in order to accommodate multiple use cases. This is done by removing or reducing the types enforced in the constructors signature to allow for more varied types to be passed in by the consumer.
This is an acceptable best practice when done appropriately. What does appropriately mean? What is “appropriate” is, of course, very much subjective. Generally speaking, the differences in the various signatures supported should be minimal at best, yet meaning should still communicated through the name of the parameters. For example, let’s take this constructor:
class Db {
/**
* @var string|array|DriverInterface $driver
*/
public function __construct($driver) {
if (is_string($driver)) {
$driver = $this->createDriverFromString($driver);
} elseif (is_array($driver)) {
$driver = $this->createDriverFromArray($driver);
}
if (!$driver instanceof DriverInterface) {
throw new Exception();
}
}
}
The above signature __construct($driver) technically supports 3 effective signatures:
__construct(/* string */ $driver); __construct(/* array */ $driver); __construct(DriverInterface $driver);
The actual signature has not changed, but it is represented all 3 effective signatures that can be further described by the PHP DocBlock.
Constructor Injection
At this point in the PHP community and in PHP-centric developer circles, it is generally accepted that injecting your dependencies is a best-practice. How developers go about injecting these dependencies is still very much debated and, in-part, up to personal and/or team preference.
There are several such methods of dependency injection: interface injection, setter injection and constructor injection to name the primary forms. For the purposes of this post, constructor injection is our primary candidate for discussion.
In short, constructor injection is a pattern of injecting all of your required dependencies into a constructor. These dependencies are usually other objects, often called services. The primary benefit of constructor injection is that after you instantiate the target object, generally, it is in the complete “ready state,” meaning that it is ready to do real work. A typical constructor signature sporting constructor injection looks like this:
class UserMapper {
}
class UserRepository {
public function __construct(UserMapper $userMapper) {
$this->userMapper = $userMapper;
}
}
The above example clearly demonstrates that before a developer can use a UserRepository object, they must first inject it with a UserMapper object.
In PHP, while in recent times we’ve started favoring dependency injection (which can add some complexity to code), we have traditionally gravitated towards code that is easy write and easy to use. Practicing good dependency injection can be tedious at times and, in many cases, dependencies for objects can be stubbed by a sensible default. This practice is also known by the name of Poka-Yoke. It allows us to develop an API that supports explicit injection of dependencies while promoting ease of use in common or majority use cases. Consider the following code:
class UserMapper implements UserMapperInterface {
}
class UserRepository {
protected $userMapper;
public function __construct(UserMapperInterface $userMapper = null) {
$this->userMapper = ($userMapper) ?: new UserMapper;
}
}
While the UserRepository allows you to inject your dependency of the UserMapper, it will, if one was not provided, instantiate a sensible default UserMapper for you. The benefits are that in the most common use cases, it is a one step usage scenario (just instantiate the UserRepository). But in unit testing scenarios or scenarios where you want to inject an alternate implementation of a UserMapper, that can be achieved through the constructor.
Dynamic Class Instantiation
Generally speaking, the following code, while legal, should be used very seldom, and only when other possible instantiation patterns have been exhausted:
$obj = new $className();
if (!$obj instanceof SomeBaseType) {
throw new InvalidTypeException();
}
Why is this a bad pattern? First, it makes the assumption up front that the constructor signature is free from any required parameters. While this is good for object types that are already known to this factory, it might not always be true of a consumers subtype of the base object in question. This patten should never be used on objects that have dependencies, or in situations where it is conceivable that a subtype might have dependencies because this takes away the possibility for a subtype to practice constructor injection.
Another problem is that instead of managing an object, or a list of objects, you are now managing a class name, or list of class names in addition to an object or list of objects. Instead, one could simply manage the objects.
If, on the other hand, you know this particular object type is no more than a value object (or similar), with no chance of it needing dependencies in subtypes, you can then cautiously use this instantiation pattern.
Prototype Pattern
So how does one create an unlimited number of objects of a particular type, with dependencies in tact, each with slight variations? Enter the prototype pattern. This is an important pattern to keep handy when you know that you’ll have objects that need to be replicated in some way and they also have service dependencies that need to be injected.
To draw a parallel, this is similar to how Javascript handles its object model. To sum prototyping up in Javascript: functions and properties are defined once per prototype rather than once per object. The new keyword instructs the engine/runtime to create a copy of the prototype and assign to a variable for further specification and interaction.
This is similar to what the Prototype Pattern does in an object-oriented inheritance model. Up front, you create a prototypical instance. This instance will have all its dependencies injected, and any shared configuration and/or values setup. Then, instead of calling new again, a factory (or the consumer) will call clone on the object (a shallow clone will be made), and a new object will be created from the original prototypical object. This newly cloned object can then be further specified, injected with the variations that make this new object unique, thus interacted with as a unique object.
Lets consider the following example involving a database connection and the Row Gateway pattern. We want to iterate a dataset from a database and during iteration, present each row as a RowGateway object. One way of handling this would be to get the array of data from the database, then during iteration, create a new RowObject from scratch injecting the database connection:
class DbAdapter {
public function fetchAllFromTable($table) {
return $arrayOfData;
}
}
class RowGateway {
public function __construct(DbAdapter $dbAdapter, $tableName, $data) {
$this->dbAdapter = $dbAdapter;
$this->tableName = $tableName;
$this->data = $data;
}
/**
* Both methods require access to the database adapter
* to fulfill their duties
*/
public function save() {}
public function delete() {}
public function refresh() {}
}
class UserRepository {
public function __construct(DbAdapter $dbAdapter) {}
public function getUsers() {
$rows = array();
foreach ($this->dbAdapter->fetchAllFromTable('user') as $rowData) {
$rows[] = new RowGateway($dbAdapter, 'user', $rowData);
}
return $rows;
}
}
A UserRepository will be constructed with a database adapter object. It will then query the database, returning an array of all the rows that satisfied that query. With each row of data, it will create a fresh RowObject from scratch, injecting all the dependencies, configuration and the row data.
At first glance, you might ask “what if I have a specialized version of RowGateway I want to use?” That solution can be easily handled by instead of hard-coding the RowGateway class, but by use the Dynamic Class Instantiation pattern described above:
class UserRepository {
public function __construct(DbAdapter $dbAdapter, $rowClass = 'RowGateway') {}
public function getUsers() {
$rows = array();
foreach ($this->dbAdapter->fetchAllFromTable('user') as $rowData) {
$rowClass = $this->rowClass;
$row = new $rowClass($dbAdapter, 'user', $rowData);
if (!$instance of RowGateway) {
throw new InvalidClassType();
}
$rows[] = $row;
}
return $rows;
}
}
This partially solves the problem in that now we can now use our own specialized class for the RowGateway implementation, but this too has its own special set of limitations. First, we are incorrectly making the assumption that the constructor signature of a subtype of RowGateway is exactly the same as the base type. This means that if a subtype has additional dependencies, that class will need to do the static dance in order to locate and consume those dependencies that it needs to achieve its specialized functionality. By making this assumption of the classes constructor signature, we’re limiting the consumers ability to practice polymorphism in the subtypes that they might need to have created.
For example, if a consumer wanted to be able to have a RowGateway object that wrote data to one specific database, but refreshed its data from a different database, how might one be able to inject two different DbAdapters into a RowGateway object to achieve this end result?
The answer is to use the Prototype Pattern, and in practice (via pseudo-code), looks like this:
class DbAdapter {
// same as before
}
class RowGateway {
public function __construct(DbAdapter $dbAdapter, $tableName) {
$this->dbAdapter = $dbAdapter;
$this->tableName = $tableName;
}
public function initialize($data) {
$this->data = $data;
}
/**
* Both methods require access to the database adapter
* to fulfill their duties
*/
public function save() {}
public function delete() {}
public function refresh() {}
}
class UserRepository {
public function __construct(DbAdapter $dbAdapter, RowGateway $rowGatewayPrototype = null) {
$this->dbAdapter = $dbAdapter;
$this->rowGatewayPrototype = ($rowGatewayPrototype) ? new RowGateway($this->dbAdapter, 'user')
}
public function getUsers() {
$rows = array();
foreach ($this->dbAdapter->fetchAllFromTable('user') as $rowData) {
$rows[] = $row = clone $this->rowGatewayPrototype;
$row->initialize($rowData);
}
return $rows;
}
}
By using a prototypical instance as the base for all future instances, we now allow the consumer the ability to extend this base implementation using sound object-oriented/polymorphic best-practices to achieve their end result. So, assuming our above example of the read/write adapter, a consumer can write:
class ReadWriteRowGateway extends RowGateway {
public function __construct(DbAdapter $readDbAdapter, DbAdapter $writeDbAdapter, $tableName) {
$this->readDbAdapter = $readDbAdapter;
parent::__construct($writeDbAdapter, $tableName);
}
public function refresh() {
// utilize $this->readDbAdapter instead of $this->dbAdapter in RowGateway base implementation
}
}
// usage:
$userRepository = new UserRepository(
$dbAdapter,
new ReadWriteRowGateway($readDbAdapter, $writeDbAdapter, 'user')
);
$users = $userRepository->getUsers();
$user = $users[0]; // instance of ReadWriteRowGateway with a specific row of data from the db
Parting Words
Be nice to people who want to consume and extend your code. A constructor is more than just a place for initialization code. How you craft your constructors, the patterns you use for their signatures, and how you expect to get new instances of objects greatly affects the ability of consumers to extend your code without having to jump through too many hoops in order form them to achieve their specialized use case. It is always better to fall back on SOLID object-oriented practices than to limit someones possibilities by forcing them into coding patterns that require reading in-depth documentation on how the original author expects someone to extend their code.
Fellow Nestordammers,
I'm delighted to announce that for the third year in a row we will be sponsoring WhereCamp EU! After London in 2010 and Berlin in 2011, ths year's event will be in Amsterdam on April 28th and 29th.
As always "camp" events are fairly free form, so it's hard to know exactly what to expect. But if past years are any guide there will be lively discussion, some interesting demos, and (just perhaps) a geobeer or three along the way. The pace of innovation in online cartography continues to accelerate, there is so much to discuss. Several members of the Nestoria team will be in attendance. We look forward to seeing you there.
Many thanks to the orgaisers and other sponsors for creating what is sure to be a great weekend. The best way to stay up to date on WhereCamp EU is of course via the twitter feed.
On a final note, if you're interested in all things web and geo but unfortunately can't make it to Amsterdam, consider joining us at #geomob events in London.
With the release of Beta 3 of Zend Framework, we now have a significantly refactored the ZendView component.
One of the changes made is that there is a ViewModel object that is returned from a controller which contains the variables to be used within the view script along with meta information such as the view script to render. The really nice thing about ViewModels is that they can be nested and this is how the layout composes the action view script.
However, we can do many more interesting things than this and I've put together a test application with a controller showing some of the things that can be done.
Some examples:
Change the layout in an action:
public function differentLayoutAction() { // Use a different layout $this->layout('layout/different'); return new ViewModel(); }
Create another view model at the layout's level:
public function addAnotherViewModelToLayoutAction() { // Use an alternative layout $layoutViewModel = $this->layout(); $layoutViewModel->setTemplate('layout/another'); // add an additional layout to the root view model (layout) $sidebar = new ViewModel(); $sidebar->setTemplate('layout/footer_one'); $layoutViewModel->addChild($sidebar, 'footer'); return new ViewModel(); }
I've created some other examples too, so I recommend that you grab the code from GitHub and have a play! The code also includes a second module, Simple, which shows how to change the layout for an entire module.
You should also read the manual!
In recent weeks, I consulted with the second most intelligent species on the planet: Dolphins. Dolphins are renowned across the known Universe for their awesome programming skills. After all, it was they who developed such insightful works as “Evolution By Example”, “Dude! We Wrote The Laws Of Physics!”, and “How Many Humans Does It Take To Screw Up A Planet?”. The answer to the last will be published on 01/01/2013 after the experiment is shut down and sent to a landfill site assuming the Supreme Spaghetti Monster signs off on the permit.
Dolphins think we are really dumb and theorise that this level of stupidity has one obvious cause: self-imposed ignorance. We are, after all, only the third most intelligent species on Earth and appear to have aspirations to lower our IQ just a bit more.
While it’s no harm poking fun at ourselves, in PHP we do have a serious problem. Cross-Site Scripting (XSS) remains one of the most significant classes of security problems afflicting PHP applications. Despite years of education, community awareness and the development of frameworks which can offer a huge boost in consistent practices – things are not getting any better.
So, I finally figured out what the core problem is: PHP programmers are completely clueless about XSS. It’s that simple. Instead of going out and studying the topic, we blindly follow some preferred herd of people offering advice with heartfelt conviction despite the fact that they are probably just as ignorant as the rest of us. Does that sound like the behaviour of something which allegedly evolved into an intelligent species? The result is a mix of ignorance and stagnant knowledge that leaves PHP in an unenviable position beset by wrongheaded zealots.
To get the ball rolling, this two-part article series is a tour of how NOT to use the htmlspecialchars() function that is typically pressed ganged into service as PHP’s universal output escaper. By offering an example based guide, I hope it will illustrate just how many ways a prospective attacker using XSS can exploit this function’s misuse to pull off a successful attack. The examples were written for PHP 5.3, so 5.4 users may need to imagine they still have 5.3 installed and/or lodge an official complaint with somebody who looks like they keep a complaints box handy (your local fast food restaurant is a good start).
This example led approach has another motive. Simple examples can be translated into unit tests. Ideally, many of the current crop of frameworks can use this article as a guide to what their unit tests should be looking for. This also makes it far easier for everyday programmers to consume the article and run around the place, drunk with ungodly power, identifying issues in the libraries, frameworks and other projects that they rely on.
To help us on the path of enlightenment before it’s too late (I’d lodge an appeal with the Supreme Spaghetti Monster but apparently the Mayans already tried and failed), I also invite other PHP programmers to blog about a security topic over the next month or two. Give programmers one last chance to get it right before the Planet is demolished by the Vogon destructor fleet. Just pick a topic that drives you up the walls in defiance of gravity and spend an hour writing something useful and (optionally) expletive filled. Every little bit helps.
What Is Htmlspecialchars()?
According to many programmers from Earth, htmlspecialchars() is a function used to escape output to prevent XSS. This is however a completely wrong definition. The function was actually co-opted by programmers to combat XSS because it was either that or create slow userland functions for which the internals developers might get around to creating, when the full moon coincided with the right planetary alignment in another 314 years, a speedier C alternative to. The actual definition (along with a half-hearted self-doubting nod to preventing XSS) is as follows:
Certain characters have special significance in HTML, and should be represented by HTML entities if they are to preserve their meanings. This function returns a string with some of these conversions made; the translations made are those most useful for everyday web programming. If you require all HTML character entities to be translated, use htmlentities() instead. This function is useful in preventing user-supplied text from containing HTML markup, such as in a message board or guest book application.
Note that this hints at, but does not explicitly use, the terms Cross-Site Scripting, XSS or even Security. Then again, it does refer to guest book applications so it was probably written in 1790 by the Dolphin who created PHP v86 and who then got around to backporting version 1.0 for Humans in the late 20th Century out of extreme pity for our reliance on CGI. No, not the let’s take an action movie and turn it into a plotless eyesore with computer generated fake stuff style CGI – though memories of both are comparably bad.
Does this make htmlspecialchars() terrible at preventing XSS? No. As part of a comprehensive well-understood strategy to prevent XSS, the function is very useful. However, in PHP it is frequently overused, misused, abused, confused and…. Darn it, ran out of rhyming words again. Suffice it to say that a good description of htmlspecialchars() is that it’s an unsuitable tool for preventing XSS that has slowly evolved into a better suited tool over the years. I keep telling myself that, at least.
The function, htmlspecialchars(), accepts four parameters. Here is its function prototype as of PHP 5.4:
string htmlspecialchars ( string $string [, int $flags = ENT_COMPAT | ENT_HTML401 [, string $encoding = 'UTF-8' [, bool $double_encode = true ]]] )
The first parameter accepts a string whose special HTML characters will be converted to HTML entities. The second accepts one or more flags which defaults to using ENT_COMPAT (does not convert single quotes to entities) but should be set to use ENT_QUOTES (does convert single quotes to entities). You can include another flag, in PHP 5.4, called ENT_SUBSTITUTE which is not a bad idea for UTF-8, i.e. ENT_QUOTES | ENT_SUBSTITUTE. You can pretend that all the other constants don’t exist. The third parameter accepts a string indicating the character encoding of the string being processed and defaults to ISO-8859-1 for PHP 5.3, and UTF-8 for PHP 5.4. Don’t ever set the fourth parameter to TRUE when escaping unless your filtering logic was written by an Über Dolphin – always keep filtering and escaping separate from each other to avoid confusing the two and then having to pointlessly argue why your way is better in defiance of all logic.
The function, if correctly configured using this super simple article for guidance, will now convert the following characters to entities: <, >, ‘, ” and &. These characters make sense to escape since they are used to construct HTML tags, delineate attribute values or reference HTML entities – none of which we want users to be able to do!
If you want some very good advice before your brain implodes from too much reading, a good way to potentially make yourself vulnerable to XSS is to not explicitly set the first two optional parameters ($flags and $encoding) to an appropriate value. In fact, if you see htmlspecialchars() missing any of those two parameters in someone’s source code, you should request that they fix it or, at the very least, curse their name and pray for the Supreme Spaghetti Monster to label them as biohazardous waste in need of emergency disposal.
Now, let’s get down to overloading your brain with information. I’m told that this part is like being sucked into the Total Perspective Vortex machine on Frogstar World B.
To Quote Or Not To Quote. How Is That A Question?
As it turns out, HTML is not simply a popular markup language, it is a popular markup language designed by a bureaucratic species of transdimensional beings seeking to drive Humanity insane by inventing the most impossible-to-secure markup language known in 172 Universes which is then interpreted by “browsers” written by Mice to test the patience of security professionals and keep the really intelligent Humans distracted from the truth of their soon-to-end existence as they search out ever more ludicrous examples of parsing weirdness. Excuse me, I held my breath writing that and need to fetch my Oxygen tank…
Consider the following example. If you want to see whether they work without copy pasting, you can clone all examples from my ominously titled xss repository on Github into a webroot somewhere to read or execute them.
-
<?php header('Content-Type: text/html; charset=UTF-8'); ?>
-
<!DOCTYPE html>
-
<?php -
$input = <<<INPUT
-
' onmouseover='alert(/Meow!/); -
INPUT;
-
/** -
* NOTE: This is equivalent to using htmlspecialchars($input, ENT_COMPAT) -
*/ -
$output = htmlspecialchars($input);
-
?> -
<html>
-
<head>
-
<title>Single Quoted Attribute</title>
-
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
-
</head>
-
<body>
-
<div>
-
<span title='<?php echo $output ?>'>
-
What's that latin placeholder text again?
-
</span>
-
</div>
-
</body>
-
</html>
If you run the example from a browser and pass your mouse pointer over the text, you will get a popup saying “/Meow!/”. Granted, this is hardly the most impressive XSS ever but remember that the Javascript executed could be a lot more ingenious and damaging. The reason you see alert() used everywhere in XSS examples is to prove that Javascript was executable – a real attacker will hardly advertise his success like this.
In this case, the htmlspecialchars() function call omits the second parameter which defaults to using the ENT_COMPAT flag. With this setting, the function does not convert single quotes to entities, allowing us to inject an unescaped single quote (to close the title attribute value) and another to start a new attribute and value which will be closed by the final single quote used in the template.
We can fix this problem in one of two ways:
1. Use double quotes which will prevent user input from breaking out of the HTML attribute value context using single quotes; or
2. Set the second parameter to htmlspecialchars() to use the ENT_QUOTES flag which will escape any single quotes a user tries to inject.
The moral of the story can be made even clearer by another example. In this case we use another perfectly validating means of delineating attribute values in HTML5 – we just don’t bother using quotes at all!
-
<?php header('Content-Type: text/html; charset=UTF-8'); ?>
-
<!DOCTYPE html>
-
<?php -
$input = <<<INPUT
-
faketitle onmouseover=alert(/Meow!/); -
INPUT;
-
$output = htmlspecialchars($input, ENT_QUOTES);
-
?> -
<html>
-
<head>
-
<title>Quoteless Attribute</title>
-
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
-
</head>
-
<body>
-
<div>
-
<span title=<?php echo $output ?>>
-
What's that latin placeholder text again?
-
</span>
-
</div>
-
</body>
-
</html>
Without quotes delineating the attribute value, any space character (including any character a browser might interpret as a space – there are a lot!) allows the user to inject new attributes and values. As from the above, converting all quotes to entities is pointless if there are no quotes to start with! Our escaping doesn’t convert spaces or other space-interpreted characters into entities at all.
By now, you should see the obvious. All HTML attribute values MUST be quoted, and preferably DOUBLE quoted, in any scenario where you suspect untrusted input will be injected into an attribute value, or where htmlspecialchars() calls do not set the second parameter to use ENT_QUOTES. Believe it or not, using single quotes or no quotes remains popular and is perfectly valid under the new HTML5 spec. Some people even celebrate this new insanity. Keep an eye on any designers who look a bit wild eyed or spend too much time smiling while staring into empty space.
Excuse Me, Sir, But Someone Ate My Quotes
One of the great mysteries in escaping output is a common myth known as the Great ASCII Delusion (GAD). Those under the influence of this delusion, besides hearing voices in their head, have arrived at a belief that many character encodings are equivalent for the purposes of escaping those characters which have a special meaning for HTML, e.g ISO-8859-1 and UTF-8. Alas, this is untrue because the Mice created something called Internet Explorer 6 – a thoroughly shameful (but still commonly used) browser which corporations across the Planet continue to insist on using because buying new computers and upgrading operating systems just to use some fancy new Microsoft Office version is seen as a waste of shareholder funds.
Internet Explorer 6 is the bad boy of the XSS world since it’s vulnerable to ridiculous exploits no decent modern browser would dare associate with. Even Netscape would probably spit on it from beyond the grave. For example, have a go with this example using IE6 and PHP 5.3. If you need a testing version of all IE browsers since IE 5.5, you can download IETester from http://www.my-debugbar.com/ietester/index_all.php and use it from Windows. Try hard, I know Windows is bad and the new Tablet makeover for Windows 8 makes you feel ill, but it’s important to see these examples in action.
-
<?php header('Content-Type: text/html; charset=UTF-8'); ?>
-
<!DOCTYPE html>
-
<?php -
/** -
* You could also subsititute xC0 or any other impacted character -
* above ASCII number 192 -
*/ -
$input1 = 'fakeimage'.chr(192);
-
$input2 = <<<INPUT2
-
onerror=alert(/Meow!/)// -
INPUT2;
-
$output1 = htmlspecialchars($input1, ENT_QUOTES);
-
$output2 = htmlspecialchars($input2, ENT_QUOTES);
-
?> -
<html>
-
<head>
-
<title>Swallowed Quotes</title>
-
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
-
</head>
-
<body>
-
<div>
-
<img src="http://example.com/images/<?php echo $output1 ?>"
-
title="<?php echo $output2 ?>">
-
</div>
-
</body>
-
</html>
With the above example, something very weird happens. Using ASCII character number 192 just before a double quote in a document being interpreted as UTF-8 results in the double quote…vanishing in IE6. Seriously, it’s there but not there. Obviously the Mice are behind it – no Human could possibly defy Physics like this!
This allows an attacker to once again break out of the HTML attribute they can inject values into. Using a coincidental opportunity to inject a second free text string nearby which a browser will concatenate to the broken out attribute value of the first, you get an effective XSS combo attack.
This IE6 quirk even bypasses the call to htmlspecialchars() which, as explained above, defaults to the ISO-8859-1 character encoding for PHP 5.3 or less. If the Great ASCII Delusion were not a fabrication of someone’s imaginative wishful thinking, this should not be possible. Not to be too harsh though, this weirdness is due primarily to a bug in IE6′s treatment of the various character encodings where you can trick the browser into thinking something like xC0 (in hex) is the start of a multi-byte character thus swallowing the next ASCII character (the double quote).
To fix the above weirdness, you must make sure that escaping is done using the same character encoding that the document is being served as. The above HTML document is identifying itself as being UTF-8 but the default htmlspecialchars() encoding is ISO-8859-1 in PHP 5.3 – there’s obviously something not agreeing there! This brings us to the absolutely perfect use (well, almost) of htmlspecialchars(), the golden rule, the Word of The Supreme Spaghetti Monster, the bringer of frustration to XSS attackers:
Always set the third parameter to htmlspecialchars(), set it correctly, and make sure your document is never served with a mismatched or invalid character encoding! Don’t expect some theoretically perfect world to magically appear – browsers are filthily efficient at doing weird things you don’t expect.
I suppose I have to mention that most versions of IE have similar issues with other character encodings such as BIG5 and Shift-JIS. You can test your IE versions using http://ha.ckers.org/weird/variable-width-encoding.cgi to see what characters can be used across different character encodings. Believe it or not, these character encodings are actually still being used and, for some strange reason, people from China and Japan do use PHP.
If you want to be completely paranoid, you can either check the input for invalid UTF-8 (Drupal and HTMLPurifier have reusable functions/classes for this), and/or run it through a conversion function which should theoretically filter out the naughty bits:
$input = mb_convert_encoding($input, 'UTF-8', 'UTF-8');
This is probably a good idea for older PHP versions pre 2010 or earlier but recent PHP versions have specifically improved htmlspecialchars() to disallow invalid characters such as the above (if you set the right character encoding!). You should be aware, though, that htmlspecialchars() may still return blank strings on certain malformed input and, since PHP 5.4, will not issue any warnings about this.
I Broke It! I Broke It!
Before you think htmlspecialchars() is getting off lightly, there is one minor quibble. We’ll keep picking on Internet Explorer 6 for the rest of this article since it’s so easy to exploit.
-
<?php header('Content-Type: text/html; charset=UTF-8'); ?>
-
<!DOCTYPE html>
-
<?php -
$input1 = 'fakeimage'."xC0";
-
$input2 = <<<INPUT2
-
onerror=alert(/Meow!/)// -
INPUT2;
-
/** -
* If you think PHP 5.4 will save you - empty strings make it guess the encoding -
* or use the default_charset value from php.ini. You sure everyone on the whole -
* planet uses UTF-8? Under 5.3 - empty strings === default encoding. -
*/ -
$encoding = ''; // from outside source or unvalidated variable
-
$output1 = htmlspecialchars($input1, ENT_QUOTES, $encoding);
-
$output2 = htmlspecialchars($input2, ENT_QUOTES, $encoding);
-
?> -
<html>
-
<head>
-
<title>Swallowed Quotes</title>
-
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
-
</head>
-
<body>
-
<div>
-
<img src="http://example.com/images/<?php echo $output1 ?>"
-
title="<?php echo $output2 ?>">
-
</div>
-
</body>
-
</html>
Setting the third $encoding parameter of htmlspecialchars() to an empty string in PHP 5.4 will set the encoding to be auto-detected, grabbed from the php.ini value of default_charset, or guessed from the current locale (in that order). Be very careful under PHP 5.4 NEVER to let this happen. Don’t leave your escaping parameters to chance.
Use empty() or strlen(), for example, to spot this issue if accepting encodings from another source or variable that might allow for empty strings. Again, this behaviour is very secure and there’s nothing wrong with it whatsoever. Oh, who am I kidding… This is the dumbest parameter behaviour ever invented. NULL means use the default encoding; blank string means play a guessing game. Even Vogon poetry pales in comparison to such nonsense. One slip and an empty parameter string can rip apart this house of cards because who knows which character encoding will be used.
Oooh, I wonder what this does under PHP 5.3… Yes, er, don’t allow blank encoding parameter strings under PHP 5.3 either. Setting an empty string in PHP 5.3 is interpreted as setting the default character encoding, i.e. ISO-8859-1, instead of triggering the expected warning about an unsupported encoding.
So, be careful kids. When setting the encoding for htmlspecialchars() do a safety check to make sure it’s not an empty string you are passing in. Keep it predictable and consistent.
There’s also one other curious behaviour when using htmlspecialchars().
-
<?php header('Content-Type: text/html; charset=UTF-8'); ?>
-
<!DOCTYPE html>
-
<?php -
error_reporting(E_ALL);
-
ini_set('display_errors', 1);
-
$input1 = 'fakeimage'."xC0";
-
$input2 = <<<INPUT2
-
onerror=alert(/Meow!/)// -
INPUT2;
-
/** -
* Invalid encoding makes htmlspecialchars() throw a warning but it continues -
* the current operation anyway using the default encoding even if the default -
* is an unsafe choice for the application. Don't allow invalid encodings! -
*/ -
$encoding = 'invalid-encoding'; // from outside source or unvalidated variable
-
$output1 = htmlspecialchars($input1, ENT_QUOTES, $encoding);
-
$output2 = htmlspecialchars($input2, ENT_QUOTES, $encoding);
-
?>
Pádraic Brady has posted A Hitchhiker’s Guide to Cross-Site Scripting (XSS) in PHP (Part 1): How Not To Use Htmlspecialchars() For Output Escaping:
Always set the third parameter to htmlspecialchars(), set it correctly, and make sure your document is never served with a mismatched or invalid character encoding! Don’t expect some theoretically perfect world to magically appear - browsers are filthily efficient at doing weird things you don’t expect.
With a nod to the anniversary of Douglas Adams' death on Sunday, Pádraic Brady has written possibly the definitive guide to the htmlspecialchars() function.
Read it. Then read it again.
The long lasting MySQL replication failover issue is cured. MySQL 5.6 makes master failover easy, PECL/mysqlnd_ms assists with the client/connection failover. Compared to the past this is a significant step towards improving MySQL replication cluster availability, eleminating the need to use 3rd party tools in many cases. The slides illustrate the basic idea, as blogged about before.
There is not much to say about the feature as such. Slave to master promotion works without hassles, finally. Regardless if you do failover because of an error of the current master or switchover because you want to change the master server, its easy now. Congratulations to the replication team!
Limitations of the current server implementation
The global transaction identifier implementation in MySQL 5.6 has a couple of limitations, though. Its not hard to guess that mixing transactional and non-transactional updates in one transaction can cause problems. Its pretty much the first pitfall I ran into when trying to setup a MySQL 5.6.5-m8 (not 5.6.4-m8…) slave using a mysqldump generated SQL dump. MySQL bailed at me and stopped me from failing.
Let a master run a transaction which first updates an InnoDB table. followed by an update of a MyISAM table, followed by another update: t(UInnoDB, UMyISAM, UX). Let the binary log settings be so that this transaction is written as one to the binary log (binlog_format=statement, binlog_direct_non_transactional_updates=0). It is then copied “as is” to the relay log of a slave. Assume that the slave runs with different binary log settings so that t(UInnoDB, UMyISAM, UX) is split up to t(UMyISAM), t(UInnoDB[, …])and logged as distinct transactions in the slaves binary log.
| Worst case: conflicting binary log settings | ||||
|---|---|---|---|---|
| Master | Slave | |||
| GTID=M:1 | t(UInnoDB, UMyISAM, UX) |
GTID=M:1 | t(UMyISAM) |
|
| GTID=M:1 | t(UInnoDB[, …]) |
|||
Because slaves must preserve global transaction identifiers they got from their master, the two resulting transactions are given the same identifier. The transaction identifier in the slaves binary log is no longer unique, it now refers to two transactions not just one (issue #1). Any slave that would read from the binary log of the above slave may loose the InnoDB transaction because it may refuse to execute a transaction using an id that has been executed already (issue #2).
The workaround? Don’t mix InnoDB and MyISAM updates in one transaction. To me it does not sound that much of an issue in 2012. Please note, I’m describing my experience with MySQL 5.6.5-m8, which is a development version.
The load balancer update
MySQL Replication takes a primary copy approach to replication. A primary/master handles all the updates. Read-only replicas/slaves replicate from the primary. The primary is a single point of failure.
| Writes | Primary/master | ||
| Reads | Slave | Slave | |
The failure of a slave is unproblematic. A client usually has plenty of other slaves to start reading from. If no slave is available, reads can even be forwarded to the master.
| PHP | |||
| Load Balancer, e.g PECL/mysqlnd_ms | |||
| Read | |||
| Slave | |||
All a PECL/mysqlnd_ms user has to do is check for an error after statement execution. If there’s one and the error code hints that the server has gone away, the user reruns the statement. The connection handle remains useable all the time. Upon rerun, PECL/mysqlnd_ms openes a new connection to another slave.
do {
$res = $mysql->query("SELECT id, title FROM news");
} while (isset($connection_error_codes[$mysql->errno]));
if (!$res) {
bail("SQL error", $mysql->errno, $mysql->error);
}
A master failure is much more problematic. There is no server to send a write to but the master. The master is a single point of failure. The new global transaction identifier help to reduce the time it takes to put a new master in place after a failure.
| PHP | |||
| Load Balancer, e.g PECL/mysqlnd_ms | |||
| Write | |||
After a master failure some process needs to promote a former slave to the new master and, preferrable atomically, update all other slaves to start replicating from the new master. The below illustration is a bit confusing. It is intentionally. What happens during the simple to say “slave to master promotion” is a complete restructuring of the cluster.
| Write | ||||
| Read | Slave (no change) | |||
Don’t forget to update the load balancer
After the cluster has been reorganized, the load balancer configurations must be updated. PECL/mysqlnd_ms happens to be a driver integrated load balancer. However, other than that, it is not different from a classical load balancer. Whatever process restructures the cluster it must take care of deploying the load balancer configurations afterwards.
Global transaction identifiers are a great help for the biggest part of the failover job - the server side. But, they are no swiss army knife. Don’t forget to update your load balancer configuration - the client side. No matter where it is. Whether it is part of the application code, driver integrated or you are using MySQL Proxy. As long as we are talking primary copy, a master failure will always be a major pain.
Happy hacking!
Ralph Schindler has posted PHP Constructor Best Practices And The Prototype Pattern
If your knowledge of constructors ends with “the place where I put my object initialization code,” read on. While this is mostly what a constructor is, the way a developer crafts their class constructor greatly impacts the initial API of a particular class/object; which ultimately affects usability and extensibility. After all, the constructor is the first impression a particular class can make.
In case you missed this last Friday, this is an in-depth look at how to construct an object in PHP whilst adhering to SOLID principles. If you missed this last week, read it now! Get a coffee first.
Both the Module Manager and the MVC system use the Event Manger extensively in order to provide "hook points" for you to add your own code into the application flow. This is a list of the events triggered by each class during a standard request with the Skeleton Application:
Module Manager
- ZendModuleManager: loadModules.pre
- For every module:
- ZendModuleManager: loadModule.resolve
- ZendModuleManager: loadModule
- ZendModuleManager: loadModules.post
Bootstrap
- ZendMvcBootstrap: bootstrap
Application
Successful:
- ZendMvcApplication: route
- ZendMvcApplication: dispatch
- ZendMvcControllerActionController: dispatch (if controller extends this class)
- ZendMvcApplication: render
- ZendViewView: renderer
- ZendViewView: response
- ZendMvcApplication: finish
- ZendMvcApplication: dispatch.error
- ZendMvcApplication: render
- ZendViewView: renderer
- ZendViewView: response
- ZendMvcApplication: finish
Note that routing and dispatching is also implemented using these registered events, so you can implement "pre" and "post" hooks by changing the priority of the listener that you register.
Today’s AWS Elastic Beanstalk announcement of PHP and Git support reminded me of the post where I mentioned that we want to let a thousand platforms bloom on AWS. Some might ask why AWS would want a thousand platforms.
One of the most important AWS principles is flexibility. Flexibility is in the choice of software and languages running on AWS, in the tools and interfaces available to manipulate resources and applications, and in the ability to leverage services from other providers. One of our customers I met last week was talking about his application and how it runs on AWS; He collects geo-location data, analyzes and crunches this data using Elastic Map Reduce, stores the data for quick access in DynamoDB, runs his user interface on Heroku and his web services layer for mobile devices on Elastic Beanstalk. This application is a great way to highlight how developers might leverage different services, abstractions, and tools to deliver the most value to their customers.
If you’re seeking ultimate flexibility, AWS allows you to interact with services such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3) directly and to piece these services together in a building block fashion. This might incur some initial groundwork, especially if you just want to deploy a simple application. AWS CloudFormation can help bring the building blocks together through its template mechanism. This simplifies the provisioning and updates, but you’re still responsible for the operational aspects of running your application.
If you don’t need control over the software stack, you can use development platforms such as AppFog, Engine Yard, and Heroku to help you manage, deploy, and monitor your applications on AWS more easily. We’ve seen some newcomers in this space over the last year such as Stackato and NodeJitsu, and each platform continues to add value through highly curated software stacks and a set of management automation.
AWS Elastic Beanstalk is another abstraction on top of the core AWS building blocks. It takes a different approach than most other development platforms by exposing the underlying resources. This approach provides the simplicity to quickly get started for application developers, but it also allows them to modify the stack to meet their goals. For example, one customer needed extensive Apache rewrite rules and a few other mods to meet his security requirements. He simply created a new AMI to use as his base for his Elastic Beanstalk container. Another pattern I have seen is customers attaching a debugger to the JVM running in their EC2 instance so that they can debug particular interaction patterns between their code and the JVM.
So is there a “one-size-fits-all” in the development platform space? No, each platform fits the needs of different developers, applications, and use cases. Preference and familiarity also play a role in why some developers choose one over the other. Ultimately, we want developers to successfully run and manage reliable, highly scalable applications on AWS, irrespective of the abstraction that their development platform of choice offers.
We will continue to work closely together with all current and future platform partners. Based on their feedback, we will develop new features and services that can help them be more successful by allowing them to focus on their customers instead of the infrastructure on which they run. This will also make it easier for new platforms to be developed such that developers will have more choice and flexibility, and they can really find the exact tools that make them most productive. AWS Elastic Beanstalk can play an important role there, too, because it is a good base for building new platforms. We are looking forward to seeing a thousand platforms bloom.
AWS Elastic Beanstalk now supports PHP applications (in addition to Java) and the ability to deploy through the popular Git version control system. To get started using PHP and Git on AWS Elastic Beanstalk, visit Deploying PHP Applications Using Git in the AWS Elastic Beanstalk Developer Guide. More details about the release at the AWS developer blog.
“The fastest HTTP request is the one not made.”
I always smile when I hear a web performance speaker say this. I forget who said it first, but I’ve heard it numerous times at conferences and meetups over the past few years. It’s true! Caching is critical for making web pages faster. I’ve written extensively about caching:
- Call to improve browser caching
- (lack of) Caching for iPhone Home Screen Apps
- Redirect caching deep dive
- Mobile cache file sizes
- Improving app cache
- Storager case study: Bing, Google
- App cache & localStorage survey
- HTTP Archive: max-age
Things are getting better – but not quickly enough. The chart below from the HTTP Archive shows that the percentage of resources that are cacheable has increased 10% during the past year (from 42% to 46%). Over that same time the number of requests per page has increased 12% and total transfer size has increased 24% (chart).
Perhaps it’s hard to make progress on caching because the problem doesn’t belong to a single group – responsibility spans website owners, third party content providers, and browser developers. One thing is certain – we have to do a better job when it comes to caching.
I’ve gathered some compelling statistics over the past few weeks that illuminate problems with caching and point to some next steps. Here are the highlights:
- 55% of resources don’t specify a max-age value
- 46% of the resources without any max-age remained unchanged over a 2 week period
- some of the most popular resources on the Web are only cacheable for an hour or two
- 40-60% of daily users to your site don’t have your resources in their cache
- 30% of users have a full cache
- for users with a full cache, the median time to fill their cache is 4 hours of active browsing
Read on to understand the full story.
My kingdom for a max-age header
Many of the caching articles I’ve written address issues such as size & space limitations, bugs with less common HTTP headers, and outdated purging logic. These are critical areas to focus on. But the basic function of caching hinges on websites specifying caching headers for their resources. This is typically done using max-age in the Cache-Control response header. This example specifies that a response can be read from cache for 1 year:
Cache-Control: max-age=31536000
Since you’re reading this blog post you probably already use max-age, but the following chart from the HTTP Archive shows that 55% of resources don’t specify a max-age value. This translates to 45 of the average website’s 81 resources needing a HTTP request even for repeat visits.
Missing max-age != dynamic
Why do 55% of resources have no caching information? Having looked at caching headers across thousands of websites my first guess is lack of awareness – many website owners simply don’t know about the benefits of caching. An alternative explanation might be that many resources are dynamic (JSON, ads, beacons, etc.) and shouldn’t be cached. Which is the bigger cause – lack of awareness or dynamic resources? Luckily we can quantify the dynamicness of these uncacheable resources using data from the HTTP Archive.
The HTTP Archive analyzes the world’s top ~50K web pages on the 1st and 15th of the month and records the HTTP headers for every resource. Using this history it’s possible to go back in time and quantify how many of today’s resources without any max-age value were identical in previous crawls. The data for the chart above (showing 55% of resources with no max-age) was gathered on Feb 15 2012. The chart below shows the percentage of those uncacheable resources that were identical in the previous crawl on Feb 1 2012. We can go back even further and see how many were identical in both the Feb 1 2012 and the Jan 15 2012 crawls. (The HTTP Archive doesn’t save response bodies so the determination of “identical” is based on the resource having the exact same URL, Last-Modified, ETag, and Content-Length.)

46% of the resources without any max-age remained unchanged over a 2 week period. This works out to 21 resources per page that could have been read from cache without any HTTP request but weren’t. Over a 1 month period 38% are unchanged – 17 resources per page.
This is a significant missed opportunity. Here are some popular websites and the number of resources that were unchanged for 1 month but did not specify max-age:
- http://www.toyota.jp/ – 172 resources without max-age & unchanged for 1 month
- http://www.sfgate.com/ – 133
- http://www.hasbro.com/ – 122
- http://www.rakuten.co.jp/ – 113
- http://www.ieee.org/ – 97
- http://www.elmundo.es/ – 80
- http://www.nih.gov/ – 76
- http://www.frys.com/ – 68
- http://www.foodnetwork.com/ – 66
- http://www.irs.gov/ – 58
- http://www.ca.gov/ – 53
- http://www.oracle.com/ – 52
- http://www.blackberry.com/ – 50
Recalling that “the fastest HTTP request is the one not made”, this is a lot of unnecessary HTTP traffic. I can’t prove it, but I strongly believe this is not intentional – it’s just a lack of awareness. The chart below reinforces this belief – it shows the percentage of resources (both cacheable and uncacheable) that remain unchanged starting from Feb 15 2012 and going back for one year.
The percentage of resources that are unchanged is nearly the same when looking at all resources as it is for only uncacheable resources: 44% vs. 46% going back 2 weeks and 35% vs. 38% going back 1 month. Given this similarity in “dynamicness” it’s likely that the absence of max-age has nothing to do with the resources themselves and is instead caused by website owners overlooking this best practice.
3rd party content
If a website owner doesn’t make their resources cacheable, they’re just hurting themselves (and their users). But if a 3rd party content provider doesn’t have good caching behavior it impacts all the websites that embed that content. This is both bad a good. It’s bad in that one uncacheable 3rd party resource can impact multiple sites. The good part is that shifting 3rd party content to adopt good caching practices also has a magnified effect.
So how are we doing when it comes to caching 3rd party content? Below is a list of the top 30 most-used resources according to the HTTP Archive. These are the resources that were used the most across the world’s top 50K web pages. The max-age value (in hours) is also shown.
- http://www.google-analytics.com/ga.js (2 hours)
- http://ssl.gstatic.com/s2/oz/images/stars/po/Publisher/sprite2.png (8760 hours)
- http://pagead2.googlesyndication.com/pagead/js/r20120208/r20110914/show_ads_impl.js (336 hours)
- http://pagead2.googlesyndication.com/pagead/render_ads.js (336 hours)
- http://pagead2.googlesyndication.com/pagead/show_ads.js (1 hour)
- https://apis.google.com/_/apps-static/_/js/gapi/gcm_ppb,googleapis_client,plusone/[...] (720 hours)
- http://pagead2.googlesyndication.com/pagead/osd.js (24 hours)
- http://pagead2.googlesyndication.com/pagead/expansion_embed.js (24 hours)
- https://apis.google.com/js/plusone.js (1 hour)
- http://googleads.g.doubleclick.net/pagead/drt/s?safe=on (1 hour)
- http://static.ak.fbcdn.net/rsrc.php/v1/y7/r/ql9vukDCc4R.png (3825 hours)
- http://connect.facebook.net/rsrc.php/v1/yQ/r/f3KaqM7xIBg.swf (164 hours)
- https://ssl.gstatic.com/s2/oz/images/stars/po/Publisher/sprite2.png (8760 hours)
- https://apis.google.com/_/apps-static/_/js/gapi/googleapis_client,iframes_styles[...] (720 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yv/r/ZSM9MGjuEiO.js (8742 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yx/r/qP7Pvs6bhpP.js (8699 hours)
- https://plusone.google.com/_/apps-static/_/ss/plusone/[...] (720 hours)
- http://b.scorecardresearch.com/beacon.js (336 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yx/r/lP_Rtwh3P-S.css (8710 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yA/r/TSn6F7aukNQ.js (8760 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yk/r/Wm4bpxemaRU.js (8702 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yZ/r/TtnIy6IhDUq.js (8699 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yy/r/0wf7ewMoKC2.css (8699 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yO/r/H0ip1JFN_jB.js (8760 hours)
- http://platform.twitter.com/widgets/hub.1329256447.html (87659 hours)
- http://static.ak.fbcdn.net/rsrc.php/v1/yv/r/T9SYP2crSuG.png (8699 hours)
- http://platform.twitter.com/widgets.js (1 hour)
- https://plusone.google.com/_/apps-static/_/js/plusone/[...] (720 hours)
- http://pagead2.googlesyndication.com/pagead/js/graphics.js (24 hours)
- http://s0.2mdn.net/879366/flashwrite_1_2.js (720 hours)
There are some interesting patterns.
- simple URLs have short cache times – Some resources have very short cache times, e.g., ga.js (1), show_ads.js (5), and twitter.com/widgets.js (27). Most of the URLs for these resources are very simple (no querystring or URL “fingerprints”) because these resource URLs are part of the snippet that website owners paste into their page. These “bootstrap” resources are given short cache times because there’s no way for the resource URL to be changed if there’s an emergency fix – instead the cached resource has to expire in order for the emergency update to be retrieved.
- long URLs have long cache times – Many 3rd party “bootstrap” scripts dynamically load other resources. These code-generated URLs are typically long and complicated because they contain some unique fingerprinting, e.g., http://pagead2.googlesyndication.com/pagead/js/r20120208/r20110914/show_ads_impl.js (3) and http://platform.twitter.com/widgets/hub.1329256447.html (25). If there’s an emergency change to one of these resources, the fingerprint in the bootstrap script can be modified so that a new URL is requested. Therefore, these fingerprinted resources can have long cache times because there’s no need to rev them in the case of an emergency fix.
- where’s Facebook’s like button? – Facebook’s like.php and likebox.php are also hugely popular but aren’t in this list because the URL contains a querystring that differs across every website. Those resources have an even more aggressive expiration policy compared to other bootstrap resources – they use
no-cache, no-store, must-revalidate. Once the like[box] bootstrap resource is loaded, it loads the other required resources: lP_Rtwh3P-S.css (19), TSn6F7aukNQ.js (20), etc. Those resources have long URLs and long cache times because they’re generated by code, as explained in the previous bullet. - short caching resources are often async – The fact that bootstrap scripts have short cache times is good for getting emergency updates, but is bad for performance because they generate many Conditional GET requests on subsequent requests. We all know that scripts block pages from loading, so these Conditional GET requests can have a significant impact on the user experience. Luckily, some 3rd party content providers are aware of this and offer async snippets for loading these bootstrap scripts mitigating the impact of their short cache times. This is true for ga.js (1), plusone.js (9), twitter.com/widgets.js (27), and Facebook’s like[box].php.
These extremely popular 3rd party snippets are in pretty good shape, but as we get out of the top widgets we quickly find that these good caching patterns degrade. In addition, more 3rd party providers need to support async snippets.
Cache sizes are too small
In January 2007 Tenni Theurer and I ran an experiment at Yahoo! to estimate how many users had a primed cache. The methodology was to embed a transparent 1×1 image in the page with an expiration date in the past. If users had the expired image in their cache the browser would issue a Conditional GET request and receive a 304 response (primed cache). Otherwise they’d get a 200 response (empty cache). I was surprised to see that 40-60% of daily users to the site didn’t have the site’s resources in their cache and 20% of page views were done without the site’s resources in the cache.
Numerous factors contribute to this high rate of unique users missing the site’s resources in their cache, but I believe the primary reason is small cache sizes. Browsers have increased the size of their caches since this experiment was run, but not enough. It’s hard to test browser cache size. Blaze.io’s article Understanding Mobile Cache Sizes shows results from their testing. Here are the max cache sizes I found for browsers on my MacBook Air. (Some browsers set the cache size based on available disk space, so let me mention that my drive is 250 GB and has 54 GB available.) I did some testing and searching to find max cache sizes for my mobile devices and IE.
- Chrome: 320 MB
- Internet Explorer 9: 250 MB
- Firefox 11: 830 MB (shown in about:cache)
- Opera 11: 20 MB (shown in Preferences | Advanced | History)
- iPhone 4, iOS 5.1: 30-35 MB (based on testing)
- Galaxy Nexus: 18 MB (based on testing)
I’m surprised that Firefox 11 has such a large cache size – that’s almost close to what I want. All the others are (way) too small. 18-35 MB on my mobile devices?! I have seven movies on my iPhone – I’d gladly trade Iron Man 2 (1.82 GB) for more cache space.
Caching in the real world
In order to justify increasing browser cache sizes we need some statistics on how many real users overflow their cache. This topic came up at last month’s Velocity Summit where we had representatives from Chrome, Internet Explorer, Firefox, Opera, and Silk. (Safari was invited but didn’t show up.) Will Chan from the Chrome team (working on SPDY) followed-up with this post on Chromium cache metrics from Windows Chrome. These are the most informative real user cache statistics I’ve ever seen. I strongly encourage you to read his article.
Some of the takeaways include:
- ~30% of users have a full cache (capped at 320 MB)
- for users with a full cache, the median time to fill their cache is 4 hours of active browsing (20 hours of clock time)
- 7% of users clear their cache at least once per week
- 19% of users experience “fatal cache corruption” at least once per week thus clearing their cache
The last stat about cache corruption is interesting – I appreciate the honesty. The IE 9 team experienced something similar. In IE 7&8 the cache was capped at 50 MB based on tests showing increasing the cache size didn’t improve the cache hit rate. They revisited this surprising result in IE9 and found that larger cache sizes actually did improve the cache hit rate:
In IE9, we took a much closer look at our cache behaviors to better understand our surprising finding that larger caches were rarely improving our hit rate. We found a number of functional problems related to what IE treats as cacheable and how the cache cleanup algorithm works. After fixing these issues, we found larger cache sizes were again resulting in better hit rates, and as a result, we’ve changed our default cache size algorithm to provide a larger default cache.
Will mentions that Chrome’s 320 MB cap should be revisited. 30% seems like a low percentage for full caches, but could be accounted for by users that aren’t very active and active users that only visit a small number of websites (for example, just Gmail and Facebook). If possible I’d like to see these full cache statistics correlated with activity. It’s likely that user who account for the biggest percentage of web visits are more likely to have a full cache, and thus experience slower page load times.
Next steps
First, much of the data for this post came from the HTTP Archive, so I’d like to thank our sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, dynaTrace Software, and Torbit.
The data presented here suggest a few areas to focus on:
Website owners need to increase their use of a Cache-Control max-age, and the max-age times need to be longer. 38% of resources were unchanged over a 1 month period, and yet only 11% of resources have a max-age value that high. Most resources, even if they change, can be refreshed by including a fingerprint in the URL specified in the HTML document. Only bootstrap scripts from 3rd parties should have short cache times (hours). Truly dynamic responses (JSON, etc.) should specify must-revalidate. A year from now rather than seeing 55% of resources without any max-age value we should see 55% cacheable for a month or more.
3rd party content providers need wider adoption of the caching and async behavior shown by the top Google, Twitter, and Facebook snippets.
Browser developers stand to bring the biggest improvements to caching. Increasing cache sizes is a likely win, especially for mobile devices. Data correlating cache sizes and user activity is needed. More intelligence around purging algorithms, such as IE 9′s prioritization based on mime type, will help when the cache fills up. More focus on personalization (what are the sites I visit most often?) would also create a faster user experience when users go to their favorite websites.
It’s great that the number of resources with caching headers grew 10% over the last year, but that just isn’t enough progress. We should really expect to double the number of resources that can be read from cache over the coming year. Just think about all those HTTP requests that can be avoided!
I've just finished presenting my talk on how I currently work on Puppet modules at Puppetcamp here in Edinburgh where I've been for the week talking on both FlossUK 2012 and Puppetcamp.
Earlier this week I opened FlossUK 2012 with my talk on 7 tools for your devops stack
One thing that I've wanted to implement for a while now is automatic vhosts on my dev box. The idea is that I want to drop a folder into a directory and have it automatically turned into a vhost for me accessible at http://foldername.dev. It turns out that this isn't nearly as hard as expected which is usually the case with things that I've been putting off!
This is how to do it.
Apache configuration
The Apache magic is in an extension called mod_vhost_alias which you may need to enable in your httpd.conf file.
You can then set up the VirtualHost wherever you keep such things. On a stock OS X, the extras/httpd-vhosts.conf file is used.
Add the following to the bottom:
<Virtualhost *:80>
VirtualDocumentRoot "/www/dev/%1/public"
ServerName vhosts.dev
ServerAlias *.dev
UseCanonicalName Off
LogFormat "%V %h %l %u %t "%r" %s %b" vcommon
ErrorLog "/www/dev/vhosts-error_log"
<Directory "/www/dev/*">
Options Indexes FollowSymLinks MultiViews
AllowOverride All
Order allow,deny
Allow from all
</Directory>
</Virtualhost>
In the VirtualHost configuration, I have used the ServerAlias and VirtualDocumentRoot directives to map http://foldername.dev to the directory /www/dev/foldername/public. Hence, any folder that I place in /www/dev will have its own virtual host. Alter these appropriately for your set-up.
Don't forget to restart Apache.
Unfortunately, the computer hasn't a clue how to handle http://foldername.dev and the obvious solution is to run a local DNS server. Another solution is to use a PAC file.
DNS server configuration
This is easy enough with dnsmasq. On OS X, use Homebrew to install like this: brew install dnsmasq. On Linux, use your package manager; on Windows, you're own your own!
Note that on OS X, you should set it to start up automatically using launchd as noted in the instructions after installation. You also need to copy the configuration file to /etc using: cp /usr/local/Cellar/dnsmasq/2.57/dnsmasq.conf.example /usr/local/etc/dnsmasq.conf (or whatever the latest version number is). on Linux, I would guess that your package manager provides a dnsmasq.conf file in /etc or /etc/dnsmasq.
Next, edit dnsmasq.conf file and added the following lines to the bottom:
listen-address=127.0.0.1 address=/.dev/127.0.0.1
Add the name server to your network configuration
On OS X, Go to System Preferences -> Network -> {Wifi or Ethernet} -> Advanced… -> DNS and click on + button at the bottom of the left hand panel and add 127.0.0.1 to the list of DNS servers. Drag 127.0.0.1 at the top of the list.
On Linux, you should use the appropriate GUI tools for your distribution or potentially edit etc/dhcp/dhclient.conf and uncomment the domain-name-servers 127.0.0.1; on line 20 (on Ubuntu).
Restart dnsmasq and you should now be able to execute host test.dev on the command line and see 127.0.0.1 as the resultant address.
Alternative to DNS server: PAC file
Since publishing this article, Chris Morell pointed out that you can also use PAC files rather than install a DNS server. Details are on his blog post.
Check it works
Create a directory called test in your dev directory. Within test, create public/index.php and within index.php add some code to prove it works. e.g. < ;?php echo "Hello World"; ?>;
If you navigate to http://test.dev, you should see "Hello World" displayed.
Caveats
A couple of caveats:
- DOCUMENT_ROOT is not /www/dev/test as you'd expect. Instead it is the global document root. See this gist for a neat way to solve this using a prepend file.
- If you use mod_rewrite, then you'll need a RewriteBase / in your .htaccess file. Alternatively, you can change the Directory section of your vhost to do the rewriting for you if all your projects are alike. Something like this should work:
<Directory "/www/dev/*"> Options Indexes FollowSymLinks MultiViews AllowOverride None Order allow,deny Allow from all RewriteEngine On RewriteBase / RewriteCond %{REQUEST_FILENAME} -s [OR] RewriteCond %{REQUEST_FILENAME} -l [OR] RewriteCond %{REQUEST_FILENAME} -d RewriteRule ^.*$ - [NC,L] RewriteRule ^.*$ index.php [NC,L] </Directory>
All done
That's it. You can now create as many projects as you like without having to worry about setting up new virtual hosts or modifying you hosts file!
Why do we have to bother about built-in GTID support in MySQL 5.6 at all? Sure, it is a tremendous step forward for a lazy primary copy system like MySQL Replication. Period. GTIDs make server-side failover easier (slides). And, load balancer, including PECL/mysqlnd_ms as an example of a driver integrated load balancer, can use them to provide session consistency. Please, see the slides. But…
… the primary remains a single point of failure. GTIDs can be described as cluster-wide transaction counters generated on the master. In case of a master failure, the slave that has replicated the highest transaction counter shall be promoted to become the master. Its the most current slave. Failover made easy - no doubt! Adequately deployed, you should reach very reasonable availability.
Know the limits of replicated systems
A multi-master (update anywhere) design does not have a single point of failure. But among the biggest is scaling a multi-master solution. Jim Gray and Pat Helland concluded 1996 in "The Dangers of Replication and a Solution": Update anywhere-anytime-anyway transactional replication has unstable behavior as the workload scales up: a ten-fold increase in nodes and traffic gives a thousand fold increase in deadlocks or reconciliations.. N^3 - buuuhhhh, anything worse than linear scale is not appreciated. Guess what: Microsoft SQL Azure is using primary copy combined with partitioning.
In practice things are not that bad, particulary not for a small number of nodes and recent algorithms. For example, MySQL Cluster (related webinar on March 29) is a true multi-master solution - even eager/synchronous. To overcome the write-scale limitations it has built-in partitioning (sharding). The two classical scale-out solutions - replication and partitioning - are combined in one product. If you want extreme performance and are ready to pay for the costs of partitioning… try it.
Anything to learn from the NoSQL kids on the block?
Some other kids offer relaxed eventual consistency just as MySQL Replication does. Sometimes the CAP theorem is cited as an excuse for it . Some leave conflict resolution, even conflict detection to the application developer . A massively scalabale, high available, synchronous update anywhere solution with built-in conflict resolution - the big thing we all dream of - is hard to create.
In the meanwhile… - maybe custer-aware APIs
While we all wait for the one-fits-all solution, there is something we can do. We can start to tell our load balancers precisely what we need and request no higher level of service than needed. Consistency - as in CAP - is one aspect of service quality. We should start to have cluster-aware APIs abstracting the details of replication architectures. Then, our load balancers, including PECL/mysqlnd_ms can hide everything that makes working with a cluster complicated (connection pooling, request splitting and redirection, failover, node selection, load distribution, …). Also, vendors can start to play with consistency to improve performance without messing up application logic.
Below is how you use the PECL/mysqlnd_ms 1.2+ function mysqlnd_ms_set_qos() to switch between eventual consistency (stale data allowed) and session concistency (read-your-writes). MySQL Replication details hidden behind a function call.
$mysqli = new mysqli("myapp", "username", "password", "database");
if (!$mysqli)
/* Of course, your error handling is nicer... */
die(sprintf("[%d] %sn", mysqli_connect_errno(), mysqli_connect_error()));
/* read-write splitting: master used */
if (!$mysqli->query("INSERT INTO orders(order_id, item) VALUES (1, 'christmas tree, 1.8m')")) {
/* Please use better error handling in your code */
die(sprintf("[%d] %sn", $mysqli->errno, $mysqli->error));
}
/* Request session consistency: read your writes */
if (!mysqlnd_ms_set_qos($mysqli, MYSQLND_MS_QOS_CONSISTENCY_SESSION))
die(sprintf("[%d] %sn", $mysqli->errno, $mysqli->error));
/* Plugin picks a node which has the changes, here: master */
if (!$res = $mysqli->query("SELECT item FROM orders WHERE order_id = 1"))
die(sprintf("[%d] %sn", $mysqli->errno, $mysqli->error));
var_dump($res->fetch_assoc());
/* Back to eventual consistency: stale data allowed */
if (!mysqlnd_ms_set_qos($mysqli, MYSQLND_MS_QOS_CONSISTENCY_EVENTUAL))
die(sprintf("[%d] %sn", $mysqli->errno, $mysqli->error));
/* Plugin picks any slave, stale data is allowed */
if (!$res = $mysqli->query("SELECT item, price FROM specials"))
die(sprintf("[%d] %sn", $mysqli->errno, $mysqli->error));
GTID for clients? Buzz alarm!
PECL/mysqlnd_ms 1.3 does not bring any ground breaking changes with regards to consistency or GTIDs. It can now either use the driver built-in GTID emulation (1.2+) or the server-side GTID feature (1.3+, MySQL 5.6) for session consistency. That’s all. I confess, the slide title is pure buzz. But in every tale is some truth.
Cluster-aware APIs and better load balancer? Follow up!
I’m convinced that good load balancers can make application developers life much easier. Read-your-writes and session consistency is an example how new API calls may come handy. Transparently replacing remote slave accesses with client-side cache accesses (coming with 1.3) is an example how load balancers can optimize overall cluster performance.
Whoever designs a replication solution in 2012 should include the load balancer into his considerations… - even for multi-master.
Happy hacking!
My previous blog post, Cache them if you can, suggests that current cache sizes are too small – especially on mobile.
Given this concern about cache size a relevant question is:
If a response is compressed, does the browser save it compressed or uncompressed?
Compression typically reduces responses by 70%. This means that a browser can cache 3x as many compressed responses if they’re saved in their compressed format.
Note that not all responses are compressed. Images make up the largest number of resources but shouldn’t be compressed. On the other hand, HTML documents, scripts, and stylesheets should be compressed and account for 30% of all requests. Being able to save 3x as many of these responses to cache could have a significant impact on cache hit rates.
It’s difficult and time-consuming to determine whether compressed responses are saved in compressed format. I created this Caching Gzip Test page to help determine browser behavior. It has two 200 KB scripts – one is compressed down to ~148 KB and the other is uncompressed. (Note that this file is random strings so the compression savings is only 25% as compared to the typical 70%.) After clearing the cache and loading the test page if the total cache disk size increases ~348 KB it means the browser saves compressed responses as compressed. If the total cache disk size increases ~400 KB it means compressed responses are saved uncompressed.
The challenging part of this experiment is finding where the cache is stored and measuring the response sizes. Firefox, Chrome, and Opera save responses as files and were easy to measure. For IE on Windows I wasn’t able to access the individual cache files (admin permissions?) but was able to measure the sizes based on the properties of the Temporary Internet Files folder. Safari saves all responses in Cache.db. I was able to see the incremental increase by modifying the experiment to be two pages: the compressed response and the uncompressed response. You can see the cache file locations and full details in the Caching Gzip Test Results page.
Here are the results for top desktop browsers:
| Browser | Compressed responses cached compressed? |
max cache size |
|---|---|---|
| Chrome 17 | yes | 320 MB* |
| Firefox 11 | yes | 850 MB* |
| IE 8 | no | 50 MB |
| IE 9 | no | 250 MB |
| Safari 5.1.2 | no | unknown |
| Opera 11 | yes | 20 MB |
* Chrome and Firefox cache size is a percentage of available disk space. Chrome is capped at 320 MB. I don’t know what Firefox’s cap is; on my laptop with 50 GB free the cache size is 830 MB.
We see that Chrome 17, Firefox 11, and Opera 11 store compressed responses in compressed format, while IE 8&9 and Safari 5 save them uncompressed. IE 8&9 have smaller cache sizes, so the fact that they uncompress responses before caching further reduces the number of responses that can be cached.
What’s the best choice? It’s possible that reading cached responses is faster if they’re already uncompressed. That would be a good next step to explore. I wouldn’t prejudge IE’s choice when it comes to performance on Windows. But it’s clear that saving compressed responses in compressed format increases the number of responses that can be cached, and this increases cache hit rates. What’s even clearer is that browsers don’t agree on the best answer. Should they?
This past December I contributed an article called Frontend SPOF in Beijing to PerfPlanet’s Performance Calendar. I hope that everyone who reads my blog also read the Performance Calendar – it’s an amazing collection of web performance articles and gurus. But in case you don’t I’m cross-posting it here. I saw a great presentation from Pat Meenan about frontend SPOF and want to raise awareness around this issue. This post contains some good insights.
Make sure to read PerfPlanet – it’s a great aggregator of WPO blog posts.
Now – flash back to December 2011…
I’m at Velocity China in Beijing as I write this article for the Performance Calendar. Since this is my second time to Beijing I was better prepared for the challenges of being behind the Great Firewall. I knew I couldn’t access popular US websites like Google, Facebook, and Twitter, but as I did my typical surfing I was surprised at how many other websites seemed to be blocked.
Business Insider
It didn’t take me long to realize the problem was frontend SPOF – when a frontend resource (script, stylesheet, or font file) causes a page to be unusable. Some pages were completely blank, such as Business Insider:
Firebug’s Net Panel shows that anywhere.js is taking a long time to download because it’s coming from platform.twitter.com – which is blocked by the firewall. Knowing that scripts block rendering of all subsequent DOM elements, we form the hypothesis that anywhere.js is being loaded in blocking mode in the HEAD. Looking at the HTML source we see that’s exactly what is happening:
<head> ... <!-- Twitter Anywhere --> <script src="https://platform.twitter.com/anywhere.js?id=ZV0...&v=1" type="text/javascript"></script> <!-- / Twitter Anywhere --> ... </head> ... <body>
If anywhere.js had been loaded asynchronously this wouldn’t happen. Instead, since anywhere.js is loaded the old way with <SCRIPT SRC=..., it blocks all the DOM elements that follow which in this case is the entire BODY of the page. If we wait long enough the request for anywhere.js times out and the page begins to render. How long does it take for the request to timeout? Looking at the “after” screenshot of Business Insider we see it takes 1 minute and 15 seconds for the request to timeout. That’s 1 minute and 15 seconds that the user is left staring at a blank white screen waiting for the Twitter script!
CNET
CNET has a slightly different experience; the navigation header is displayed but the rest of the page is blocked from rendering:
Looking in Firebug we see that wrapper.js from cdn.eyewonder.com is “pending” – this must be another domain that’s blocked by the firewall. Based on where the rendering stops our guess is that the wrapper.js SCRIPT tag is immediately after the navigation header and is loaded in blocking mode thus preventing the rest of the page from rendering. The HTML confirms that this is indeed what’s happening:
<header> ... </header> <script src="http://cdn.eyewonder.com/100125/771933/1592365/wrapper.js"></script> <div id="rb_wrap"> <div id="rb_content"> <div id="contentMain">
O’Reilly Radar
Everyday I visit O’Reilly Radar to read Nat Torkington’s Four Short Links. Normally Nat’s is one of many stories on the Radar front page, but going there from Beijing shows a page with only one story:
At the bottom of this first story there’s supposed to be a Tweet button. This button is added by the widgets.js script fetched from platform.twitter.com which is blocked by the Great Firewall. This wouldn’t be an issue if widgets.js was fetched asynchronously, but sadly a peek at the HTML shows that’s not the case:
<a href="http://www.stevesouders.com/blog...">Comment</a> | <span class="social-counters"> <span class="retweet"> <a href="http://twitter.com/share" class="twitter-share-button" data-count="horizontal" data-url="http://radar.oreilly.com/2011/12/four-short-links-6-december-20-1.html" data-text="Four short links: 6 December 2011" data-via="radar" data-related="oreillymedia:oreilly.com">Tweet</a> <script src="http://platform.twitter.com/widgets.js" type="text/javascript"></script> </span>
The cause of frontend SPOF
One possible takeaway from these examples might be that frontend SPOF is specific to Twitter and eyewonder and a few other 3rd party widgets. Sadly, frontend SPOF can be caused by any 3rd party widget, and even from the main website’s own scripts, stylesheets, or font files.
Another possible takeaway from these examples might be to avoid 3rd party widgets that are blocked by the Great Firewall. But the Great Firewall isn’t the only cause of frontend SPOF – it just makes it easier to reproduce. Any script, stylesheet, or font file that takes a long time to return has the potential to cause frontend SPOF. This typically happens when there’s an outage or some other type of failure, such as an overloaded server where the HTTP request languishes in the server’s queue for so long the browser times out.
The true cause of frontend SPOF is loading a script, stylesheet, or font file in a blocking manner. The table in my frontend SPOF blog post shows when this happens. It’s really the website owner who controls whether or not their site is vulnerable to frontend SPOF. So what’s a website owner to do?
Avoiding frontend SPOF
The best way to avoid frontend SPOF is to load scripts asynchronously. Many popular 3rd party widgets do this by default, such as Google Analytics, Facebook, and Meebo. Twitter also has an async snippet for the Tweet button that O’Reilly Radar should use. If the widgets you use don’t offer an async version you can try Stoyan’s Social button BFFs async pattern.
Another solution is to wrap your widgets in an iframe. This isn’t always possible, but in two of the examples above the widget is eventually served in an iframe. Putting them in an iframe from the start would have avoided the frontend SPOF problems.
For the sake of brevity I’ve focused on solutions for scripts. Solutions for font files can be found in my @font-face and performance blog post. I’m not aware of much research on loading stylesheets asynchronously. Causing too many reflows and FOUC are concerns that need to be addressed.
Call to action
Business Insider, CNET, and O’Reilly Radar all have visitors from China, and yet the way their pages are constructed delivers a bad user experience where most if not all of the page is blocked for more than a minute. This isn’t a P2 frontend JavaScript issue. This is an outage. If the backend servers for these websites took 1 minute to send back a response, you can bet the DevOps teams at Business Insider, CNET, and O’Reilly wouldn’t sleep until the problem was fixed. So why is there so little concern about frontend SPOF?
Frontend SPOF doesn’t get much attention – it definitely doesn’t get the attention it deserves given how easily it can bring down a website. One reason is it’s hard to diagnose. There are a lot of monitors that will start going off if a server response time exceeds 60 seconds. And since all that activity is on the backend it’s easier to isolate the cause. Is it that pagers don’t go off when clientside page load times exceed 60 seconds? That’s hard to believe, but perhaps that’s the case.
Perhaps it’s the way page load times are tracked. If you’re looking at worldwide medians, or even averages, and China isn’t a major audience your page load time stats might not exceed alert levels when frontend SPOF happens. Or maybe page load times are mostly tracked using synthetic testing, and those user agents aren’t subjected to real world issues like the Great Firewall.
One thing website owners can do is ignore frontend SPOF until it’s triggered by some future outage. A quick calculation shows this is a scary choice. If a 3rd party widget has a 99.99% uptime and a website has five widgets that aren’t async, the probability of frontend SPOF is 0.05%. If we drop uptime to 99.9% the probability of frontend SPOF climbs to 0.5%. Five widgets might be high, but remember that “third party widget” includes ads and metrics. Also, the website’s own resources can cause frontend SPOF which brings the number even higher. The average website today contains 14 scripts any of which could cause frontend SPOF if they’re not loaded async.
Frontend SPOF is a real problem that needs more attention. Website owners should use async snippets and patterns, monitor real user page load times, and look beyond averages to 95th percentiles and standard deviations. Doing these things will mitigate the risk of subjecting users to the dreaded blank white page. A chain is only as strong as its weakest link. What’s your website’s weakest link? There’s a lot of focus on backend resiliency. I’ll wager your weakest link is on the frontend.
[Originally posted as part of PerfPlanet's Performance Calendar 2011.]
The new view layer in Zend Framework 2 can be set up to return JSON rather than rendered HTML relatively easily. There are two steps to this:
Set up the JsonStrategy
Firstly we need to set up the view's JsonStrategy to check to a situation when returning JSON is required and then to render out JSON for us. The JsonStrategy will cause the JsonRenderer to be run in two situations:
- The view model returned by the controller action is a JsonModel
- The HTTP Accept header sent in the Request include "application/json"
The enable the JsonStrategy, we simply attach it to the view's event manager with a reasonably high priority. This can be done in our Application's Module class. Firstly we create an onBootstrap() callback on the bootstrap event and then we implement onBootstrap() to attaché the JsonStrategy:
module/Application/Module.php:
class Module implements AutoloaderProvider { public function init(Manager $moduleManager) { $events = StaticEventManager::getInstance(); $events->attach('bootstrap', 'bootstrap', array($this, 'onBootstrap')); } public function onBootstrap(Event $e) { $application = $e->getParam('application'); /* @var $application ZendMvcApplication */ $locator = $application->getLocator(); $view = $locator->get('ZendViewView'); $jsonStrategy = $locator->get('ZendViewStrategyJsonStrategy'); $view->events()->attach($jsonStrategy, 100); } // more methods such as getConfig() and getAutoloaderConfig() }
As you can see, in init() we grab the StaticEventManager to attach our onBootstrap() method to the bootstrap event. Then, within onBootstrap(), we grab the view and the JsonStrategy from the locator (via application) and attach the JsonStrategy to the view's events() event manager.
Return a JsonModel from the controller action
To send JSON to the client when the Accept header isn't application/json, we use a JsonModel in a controller action like this:
module/Application/src/Application/Controller/IndexController.php:
namespace ApplicationController; use ZendMvcControllerActionController, ZendViewModelViewModel, ZendViewModelJsonModel; class IndexController extends ActionController { public function indexAction() { $result = new JsonModel(array( 'some_parameter' => 'some value', 'success'=>true, )); return $result; } }
The output will now be JSON. Obviously, if you're sending JSON back based on the Accept header, then you can return a normal ViewModel.
Following yesterday's article on returning JSON from a ZF2 controller action, Lukas suggested that I should also demonstrate how to use the Accept header to get JSON. So this is how you do it!
Set up the JsonStrategy
We set up the JsonStrategy as we did in returning JSON from a ZF2 controller action.
Return a ViewModel from the controller
As we're letting the JsonStrategy intercede for us, we don't need to do anything special in our controller at all. In this case, we simply return a normal ViewModel for use by either the JsonRenderer or PhpRenderer as required:
module/Application/src/Application/Controller/IndexController.php:
<?php namespace ApplicationController; use ZendMvcControllerActionController, ZendViewModelViewModel; class IndexController extends ActionController { public function anotherAction() { $matches[] = array('distance' => 10, 'playground' => array('a'=>1)); $matches[] = array('distance' => 20, 'playground' => array('a'=>2)); $matches[] = array('distance' => 30, 'playground' => array('a'=>3)); $result = new ViewModel(array( 'success'=>true, 'results' => $matches, )); return $result; } }
with our HTML view script:
module/Application/view/index/another.phtml:
<?php if ($success): ?> <h2>Results</h2> <ul> <?php foreach ($results as $row): ?> <li>Distance: <?php echo $this->escape($row['distance']);?>m</li> <?php endforeach; ?> </ul> <?php endif; ?>
So if you set up a route and browse to it, you'll see a nicely rendered page.
Retrieving the data as JSON
To retrieve the data via JSON, we need a client where we can set the Accept header. We'll use curl for this test. When doing anything with APIs and testing, we head over to LornaJane's blog for the Curl Cheat Sheet and use this command line:
curl -H "Accept: application/json" http://zf2test.dev/json/another
and you should see the output of:
{
"content":{
"success":true,
"results": [
{"distance":10,"playground":{"a":1}},
{"distance":20,"playground":{"a":2}},
{"distance":30,"playground":{"a":3}}
]
}
}
(Formatted for readability - you get the result back on a single line from curl.)
This way you can use the same controllers for your HTML views and for returning JSON to those clients that can use it.

















