As mentioned in an older post, I’m building a test and development environment using OpenStack. The environment is intended to be fairly integrated. Part of this integration is a consistent working environment between instances in a project. Providing home directories via NFS is the easiest way of ensuring this consistent working environment.
The problem with NFS home directories, however, is that they are fairly insecure. They can be used between instances to escalate privileges. In our environment, this isn’t a problem for instances within a project. If a user is a member of a project, they have shell on all instances. If they are given sudo access in a project, they are given sudo access on all instances. Between projects, however, is a problem. user-A with root on instance-A in project-A could su to to user-B on instance-A, modify the user’s authorized_keys file, and then have access to project-B if home directories are shared across projects.
To avoid cross-project escalation, each project needs its own set of home directories. This means we can’t simply export /home to the instance’s private network and be done with it. We’ll need to create an exports file, and share different directory trees with specific instances. We’ll also need to mount home directories differently on the instances, depending on the project they belong to.
To do so, we’ll use a combination of puppet, LDAP, autofs, and Nova.
Creating the exports file
To create the exports file we need three things:
- A list of projects
- A list of instances within each project
- A list of home directory locations for each project
The first two can be found via LDAP. The query for a list of projects is: ‘(&(objectclass=groupofnames)(owner=*))’. The query for a list of instances within each project is: ‘(puppervar=instanceproject=<project>)’. Of course, this approach is only usable for people using the OpenStackManager extension for MediaWiki; I’ll mention more portable ways to get this information later in the post.
The third we can extrapolate, we just need a base directory. I chose to use /export/home/<project> for the locations.
I wrote a python script that will pull this information, and create an exports file that looks like this:
/export/home/<project1> <project1-instance1>(rw,no_subtree_check) <project-instance2>(rw,no_subtree_check) <project_instance...>(rw,no_subtree_check) /export/home/<project2> <project2-instance1>(rw,no_subtree_check) <project2-instance2>(rw,no_subtree_check) <project2_instance...>(rw,no_subtree_check)
Mounting the shares from the instances
Each instance needs to mount the share, depending on its project. There’s a number of ways we can do this, but I like the flexibility of using autofs and LDAP to manage NFS mounts. To add slightly more flexibility we’ll involve the help of puppet as well.
In LDAP, we can create autofs entries by making maps and objects. The following objects add support for /home:
dn: nisMapName=auto.master,<basedn> objectClass: top objectClass: nisMap nisMapName: auto.master dn: nisMapName=auto.home,<basedn> objectClass: top objectClass: nisMap nisMapName: auto.home
dn: nisMapName=/home,nisMapName=auto.master,<basedn> objectClass: top objectClass: nisObject cn: /home nisMapEntry: ldap:nisMapName=auto.home,<basedn> nisMapName: auto.master
We also need to add entries for the specific home directories. Here we are going to invoke a little awesome magic that autofs has: variables. Here’s the entry we are using for all home directories in all projects:
dn: cn=*,nisMapName=auto.home,<basedn>
changetype: add
nisMapEntry: ${SERVNAME}:${HOMEDIRLOC}/&
objectClass: nisObject
objectClass: top
nisMapName: auto.home
cn: *
We only need the one entry, which saves us from having to create and delete entries on creation and deletion of projects. Using this, however, means we need to set the variables. This is where puppet comes in. First, though, let’s look at the node in LDAP:
dn: dc=i-0000005c,dc=pmtpa,ou=hosts,dc=wikimedia,dc=org objectClass: domainrelatedobject objectClass: dnsdomain objectClass: domain objectClass: puppetclient objectClass: dcobject objectClass: top puppetVar: realm=labs puppetVar: writable=false puppetVar: db_cluster=s1 puppetVar: instancecreator_email=rlane@wikimedia.org puppetVar: instancecreator_username=Ryan Lane puppetVar: instancecreator_lang=en puppetVar: instanceproject=testlabs puppetClass: base puppetClass: ldap::client::wmf-test-cluster puppetClass: exim::simple-mail-sender puppetClass: db::core puppetClass: mysql::mysqluser puppetClass: mysql::datadirs puppetClass: mysql::conf l: pmtpa associatedDomain: i-0000005c.pmtpa.wmflabs associatedDomain: labs-db2.pmtpa.wmflabs dc: i-0000005c aRecord: 10.4.0.12
All the above objectclasses and attributes are available for use in puppet. The really important one here is instanceproject=testlabs.
We can set the autofs variables via the OPTIONSvariable in the /etc/default/autofs file:
OPTIONS=”-DSERVNAME=<%= nfs_server_name %> -DHOMEDIRLOC=<%= homedir_location %>”
Here SERVNAME and HOMEDIRLOC autofs variables are being set. nfs_server_name and homedir_location are being set via a puppet template. Both are being determined via a manifest:
$homedir_location = "/export/home/${instanceproject}"
nfs_server_name is a hash, based on the project:
$nfs_server_name = $instanceproject ? {
default => "labs-nfs1",
}
I chose to use a hash based on project so that I can choose to separate the server based on project as well, if needed for performance, or extra security.
Managing user home directories
Everything up to this point is just creating the shares. However, we must maintain the users’ home directories as well. For this, we need to know which users are in which projects, and we need to manage their home directories per project.
I wrote a script to search for users, based on a group (the project), that selectively creates/deletes/renames home directories and authorized_keys files. I should note here that I don’t use nova’s mechanisms for SSH key management, as it isn’t portable between applications. I instead store the keys in the user’s LDAP entry.
There’s a security issue with management of home directories. If a user is added to a project and we create a home directory, with a populated authorized_keys file, then remove the user from the project, but don’t remove their home directory, the user will still have access to the project’s instances. There’s two ways I go about solving this issue:
- Ensure the user only has access to the instance if they are in the project, using access.conf. In my architecture, when projects are added, they are also made a posixgroup, with a gid. Thanks to this, we can treat the project as a system group in all instances. In access.conf we limit access to the project group.
- User’s home directories are moved from /export/home/<project>/<user> to /export/home/<project>/SAVE/<user> when they are removed from the project.
Problems with this solution, and future improvements to make
The major shortcoming of this solution is that it isn’t terribly portable. It is dependent on using LDAP, and storing specific information in the LDAP directory. Using the nova tools, or having nova manage the exports on instance creation/deletion would make this a much more portable solution.
Another shortcoming is that it isn’t terribly scalable. The exports file is being created from scratch every single script run (which needs to happen fairly frequently). Ideally, nova would write to a queue, and the NFS instance would add/remove instances from the exports as instances are created/deleted.
Thankfully, I didn’t have a shortage of ideas about how to accomplish this, as shown in my proposal. I decided upon the quick and dirty approach, opting to do one of the more reusable approaches later. I’ll likely add support to nova for this at some time in the future.
<#comment hash="f92e3f4a596ee1383542fa82e3050512" /> <#comment hash="9d6ee31bc358db3224830f8469fa13c0" />Related posts:
For the proper automation of a service using puppet, it’s necessary to ensure the service can be installed repeatedly, and that the service is fully up and ready when it is built. To ensure this, I’m using the following process, using nova:
- Create an instance and use it to do experimentation with the service.
- Document the service, along with the installation process on wikitech, after ensuring the service is working properly.
- Create a second instance. Following the documentation written, puppetize the service.
- Create a third instance. Ensure the puppetized service runs properly when initialized from scratch.
- Kill all three instances, and replace the instances in the test cluster.
When a service changes in puppet, follow the above cycle as well.
Using this process, I can be assured the puppet manifests, as written, will allow me to repeatedly install this service.
<#comment hash="f92e3f4a596ee1383542fa82e3050512" /> <#comment hash="9d6ee31bc358db3224830f8469fa13c0" />Related posts:
At Stack Exchange our use case for virtualization is growing. We are not going to run our core QA web servers and database servers using virtualization for performance reasons, but we do host things such as our monitoring system, blogs, domain controllers, and VPN servers.
Our collection of assorted services continues to grow, and with it so does our need to expand our virtualization setup. Currently in our main data center we have 3 VMWare ESX servers. But as we expand, how are we going to handle this growth?
Why Use Virtualization?
Virtualization at its heart is an abstraction layer between the hardware and the operating system. I have always had mixed feelings about this because operating systems, in theory, are supposed to provide all the hardware abstraction and inter service protection you need. However, system administrators have to live in the real world, and this just isn’t the case.
This layer of abstraction, as any abstraction, has performance implications. This in short is why we are not using it for our core QA service. The advantages of this abstraction layer however are tantalizing:
- Live migration (vMotion in VMWare terms)
- Running multiple operating systems (i.e. Windows and Linux) on the same hardware
- Easier to get full utilization of hardware resources by moving VMs around
These advantages and others exist because of this abstraction layer. From a pure systems perspective, the allure of virtualization is to deliver us from many of the hardware constraints when we design systems and go about our day to day tasks. Operating systems become modular to the hardware, and with modularity comes flexibility and agility. Flexibility and agility come from the lifting of constraints and are perhaps some of the most desirable qualities in a system. However, does virtualization deliver on this promise of flexibility?
The Joy of Commodity Hardware
As Wikipedia defines it:
“Commodity computing (or Commodity cluster computing) is to use large numbers of already available computing components for parallel computing … commodity computing done with commodity computers as opposed to high-cost supermicrocomputers or boutique computers.”
Today the commodity computer is your standard x64 computer with some varation of one or a couple cores, SAS or SATA spinning disks or SSDs, and some memory. You can debate where to draw the line in this, for instance some might call servers from Dell “specialized” servers where as boxes built from parts at Newegg are not. However, I consider all this commodity hardware because they are essentially variations on the same design — basically better versions of your home computer. The opposite of this is specialized hardware. With specialized hardware, there are major differences between vendors and they generally their own OS or a specialized variant of an operating system.
So what is the joy of commodity hardware? In my mind it is that it delivers on some of the same ideals that we want virtualization — modularity and flexibility. When you design for commodity hardware your servers are essentially interchangeable parts. They can be reused for other things and easily upgraded or replaced with newer versions as computing evolves. It also generally scales in a linear fashion, when you need more power, you just add more boxes.
Specialized hardware on the other hand has the advantage of being more well suited and optimized for its particular task. With this optimization though comes with the cost of lost modularity. Probably the most common example of specialized hardware in many data centers are SANs. They are the ultimate performers when it comes to storage, but you are likely not going to easily swap out your SAN and it can become a central constraint you design around.
Virtualization and Centralized Storage are Best Friends
With VMWare and many forms of virtualization, many of the features are designed to expect shared storage which generally comes in the form of a SAN. This relationship can be seen on the business side of things as well — EMC, one of the largest players in storage, is also the primary holder of VMWare.
Because the traditional virtualization infrastructure is designed around shared storage, the flexibility provided by virtualization comes in conflict with the flexibility of commodity hardware. That doesn’t mean shared storage can’t provide its own form of flexibility, but in my mind, these two are at odds with the traditional virtualization architecture. One of my main concerns is that over time the specialized hardware will weigh us down.
Virtualized Clusters to the Rescue?
If we can have the best of both worlds, it seems to me that it is going to come in the form of a virtual cluster. I first learned about these from a short presentation I saw by Tom Limoncelli about Ganeti. Ganeti is a console for managing virtual clusters built on top of Xen or KVM that is used at Google for some of their internal systems. The idea essentially is that you have a rack of commodity machines with many VMs per machine and still have the ability to do live migration. Using DRDB (think raid 1 across multiple machines) allows for features like live migration without shared storage.
VMWare also offers an appliance called the VMWare vSphere Storage Appliance (VSA) which seems like it might also deliver some of the features you normally only get with a SAN without the SAN — but this doesn’t seem to be the traditional VMWare design.
Virtualized clusters seem like they will give us a lot of the flexibility we want from virtualization while also allowing us to stick with commodity hardware. Writes across network RAID will be slower because they need to be commited to the mirror, but not all VMs would need to have this enabled, and I don’t think performance is our primary concern when it comes to our use of virtualization.
What Will We Go With?
Like when we tried to figure out what to do about storage, I don’t think this is a choice we can make over night. Virtual clusters are very appealing to me, but we will need to take them for a spin and learn what the limitations are. Centralized storage doesn’t sit well with the ideals and promises of commodity computing, but as I said before, system administrators need to operate in the real world with real constraints — so a SAN might be the best solution for us.
At Stack Exchange our use case for virtualization is growing. We are not going to run our core QA web servers and database servers using virtualization for performance reasons, but we do host things such as our monitoring system, blogs, domain controllers, and VPN servers.
Our collection of assorted services continues to grow, and with it so does our need to expand our virtualization setup. Currently in our main data center we have 3 VMWare ESX servers. But as we expand, how are we going to handle this growth?
Why Use Virtualization?
Virtualization at its heart is an abstraction layer between the hardware and the operating system. I have always had mixed feelings about this because operating systems, in theory, are supposed to provide all the hardware abstraction and inter service protection you need. However, system administrators have to live in the real world, and this just isn’t the case.
This layer of abstraction, as any abstraction, has performance implications. This in short is why we are not using it for our core QA service. The advantages of this abstraction layer however are tantalizing:
- Live migration (vMotion in VMWare terms)
- Running multiple operating systems (i.e. Windows and Linux) on the same hardware
- Easier to get full utilization of hardware resources by moving VMs around
These advantages and others exist because of this abstraction layer. From a pure systems perspective, the allure of virtualization is to deliver us from many of the hardware constraints when we design systems and go about our day to day tasks. Operating systems become modular to the hardware, and with modularity comes flexibility and agility. Flexibility and agility come from the lifting of constraints and are perhaps some of the most desirable qualities in a system. However, does virtualization deliver on this promise of flexibility?
The Joy of Commodity Hardware
As Wikipedia defines it:
“Commodity computing (or Commodity cluster computing) is to use large numbers of already available computing components for parallel computing … commodity computing done with commodity computers as opposed to high-cost supermicrocomputers or boutique computers.”
Today the commodity computer is your standard x64 computer with some varation of one or a couple cores, SAS or SATA spinning disks or SSDs, and some memory. You can debate where to draw the line in this, for instance some might call servers from Dell “specialized” servers where as boxes built from parts at Newegg are not. However, I consider all this commodity hardware because they are essentially variations on the same design — basically better versions of your home computer. The opposite of this is specialized hardware. With specialized hardware, there are major differences between vendors and they generally their own OS or a specialized variant of an operating system.
So what is the joy of commodity hardware? In my mind it is that it delivers on some of the same ideals that we want virtualization — modularity and flexibility. When you design for commodity hardware your servers are essentially interchangeable parts. They can be reused for other things and easily upgraded or replaced with newer versions as computing evolves. It also generally scales in a linear fashion, when you need more power, you just add more boxes.
Specialized hardware on the other hand has the advantage of being more well suited and optimized for its particular task. With this optimization though comes with the cost of lost modularity. Probably the most common example of specialized hardware in many data centers are SANs. They are the ultimate performers when it comes to storage, but you are likely not going to easily swap out your SAN and it can become a central constraint you design around.
Virtualization and Centralized Storage are Best Friends
With VMWare and many forms of virtualization, many of the features are designed to expect shared storage which generally comes in the form of a SAN. This relationship can be seen on the business side of things as well — EMC, one of the largest players in storage, is also the primary holder of VMWare.
Because the traditional virtualization infrastructure is designed around shared storage, the flexibility provided by virtualization comes in conflict with the flexibility of commodity hardware. That doesn’t mean shared storage can’t provide its own form of flexibility, but in my mind, these two are at odds with the traditional virtualization architecture. One of my main concerns is that over time the specialized hardware will weigh us down.
Virtualized Clusters to the Rescue?
If we can have the best of both worlds, it seems to me that it is going to come in the form of a virtual cluster. I first learned about these from a short presentation I saw by Tom Limoncelli about Ganeti. Ganeti is a console for managing virtual clusters built on top of Xen or KVM that is used at Google for some of their internal systems. The idea essentially is that you have a rack of commodity machines with many VMs per machine and still have the ability to do live migration. Using DRDB (think raid 1 across multiple machines) allows for features like live migration without shared storage.
VMWare also offers an appliance called the VMWare vSphere Storage Appliance (VSA) which seems like it might also deliver some of the features you normally only get with a SAN without the SAN — but this doesn’t seem to be the traditional VMWare design.
Virtualized clusters seem like they will give us a lot of the flexibility we want from virtualization while also allowing us to stick with commodity hardware. Writes across network RAID will be slower because they need to be commited to the mirror, but not all VMs would need to have this enabled, and I don’t think performance is our primary concern when it comes to our use of virtualization.
What Will We Go With?
Like when we tried to figure out what to do about storage, I don’t think this is a choice we can make over night. Virtual clusters are very appealing to me, but we will need to take them for a spin and learn what the limitations are. Centralized storage doesn’t sit well with the ideals and promises of commodity computing, but as I said before, system administrators need to operate in the real world with real constraints — so a SAN might be the best solution for us.
A couple of weeks ago a friend of mine asked me how to use MySQL stored procedures with PHP’s mysqli API. Out of curiosity I asked another friend, a team lead, how things where going with their PHP MySQL project, for which they had planned to have most of their business logic in stored procedures. I got an email in reply stating something along the lines: "Our developers found that mysqli does not support stored procedures correctly. We use PDO.". Well, the existing documentation from PHP 5.0 times is not stellar, I confess. But still, that’s a bit too much… it ain’t that difficult. And, it works.
Using stored procedures with mysqli
The MySQL database supports stored procedures. A stored procedure is a subroutine stored in the database catalog. Applications can call and execute the stored procedure. The CALL SQL statement is used to execute a stored procedure.
Parameter
Stored procedures can have IN, INOUT and OUT parameters. The mysqli interface has no special notion for the different kinds of parameters.
IN parameter
Input parameters are provided with the CALL statement. Please, make sure values are escaped correctly.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT)"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$mysqli->query("DROP PROCEDURE IF EXISTS p") ||
!$mysqli->query("CREATE PROCEDURE p(IN id_val INT) BEGIN INSERT INTO test(id) VALUES(id_val); END;"))
echo "Stored procedure creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$mysqli->query("CALL p(1)"))
echo "CALL failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!($res = $mysqli->query("SELECT id FROM test")))
echo "SELECT failed: (" . $mysqli->errno . ") " . $mysqli->error;
var_dump($res->fetch_assoc());
array(1) {
["id"]=>
string(1) "1"
}
INOUT/OUT parameter
The values of INOUT/OUT parameters are accessed using session variables.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP PROCEDURE IF EXISTS p") ||
!$mysqli->query('CREATE PROCEDURE p(OUT msg VARCHAR(50)) BEGIN SELECT "Hi!" INTO msg; END;'))
echo "Stored procedure creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$mysqli->query("SET @msg = ''") ||
!$mysqli->query("CALL p(@msg)"))
echo "CALL failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!($res = $mysqli->query("SELECT @msg as _p_out")))
echo "Fetch failed: (" . $mysqli->errno . ") " . $mysqli->error;
$row = $res->fetch_assoc();
echo $row['_p_out'];
Hi!
Application and framework developers may be able to provide a more convenient API using a mix of session variables and databased catalog inspection. However, please note the possible performance impact of a custom solution based on catalog inspection.
Handling result sets
Stored procedures can return result sets. Result sets returned from a stored procedure cannot be fetched correctly using mysqli_query(). The mysqli_query() function combines statement execution and fetching the first result set into a buffered result set, if any. However, there are additional stored procedure result sets hidden from the user which cause mysqli_query() to fail returning the user expected result sets.
Result sets returned from a stored procedure are fetched using mysqli_real_query() or mysqli_multi_query(). Both functions allow fetching any number of result sets returned by a statement, such as CALL. Failing to fetch all result sets returned by a stored procedure causes an error.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT)") ||
!$mysqli->query("INSERT INTO test(id) VALUES (1), (2), (3)"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$mysqli->query("DROP PROCEDURE IF EXISTS p") ||
!$mysqli->query('CREATE PROCEDURE p() READS SQL DATA BEGIN SELECT id FROM test; SELECT id + 1 FROM test; END;'))
echo "Stored procedure creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$mysqli->multi_query("CALL p()"))
echo "CALL failed: (" . $mysqli->errno . ") " . $mysqli->error;
do {
if ($res = $mysqli->store_result()) {
printf("---n");
var_dump($res->fetch_all());
$res->free();
} else {
if ($mysqli->errno)
echo "Store failed: (" . $mysqli->errno . ") " . $mysqli->error;
}
} while ($mysqli->more_results() &&t; $mysqli->next_result());
---
array(3) {
[0]=>
array(1) {
[0]=>
string(1) "1"
}
[1]=>
array(1) {
[0]=>
string(1) "2"
}
[2]=>
array(1) {
[0]=>
string(1) "3"
}
}
---
array(3) {
[0]=>
array(1) {
[0]=>
string(1) "2"
}
[1]=>
array(1) {
[0]=>
string(1) "3"
}
[2]=>
array(1) {
[0]=>
string(1) "4"
}
}
Use of prepared statements
No special handling is required when using the prepared statement interface for fetching results from the same stored procedure as above. The prepared statement and non-prepared statement interfaces are similar. Please note, that not every MySQL server version may support preparing the CALL SQL statement.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT)") ||
!$mysqli->query("INSERT INTO test(id) VALUES (1), (2), (3)"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$mysqli->query("DROP PROCEDURE IF EXISTS p") ||
!$mysqli->query('CREATE PROCEDURE p() READS SQL DATA BEGIN SELECT id FROM test; SELECT id + 1 FROM test; END;'))
echo "Stored procedure creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!($stmt = $mysqli->prepare("CALL p()")))
echo "Prepare failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$stmt->execute())
echo "Execute failed: (" . $stmt->errno . ") " . $stmt->error;
do {
if ($res = $stmt->get_result()) {
printf("---n");
var_dump(mysqli_fetch_all($res));
mysqli_free_result($res);
} else {
if ($mysqli->errno)
echo "Store failed: (" . $mysqli->errno . ") " . $mysqli->error;
}
} while ($stmt->more_results() &&t; $stmt->next_result());
Of course, use of the bind API for fetching is supported as well.
if (!($stmt = $mysqli->prepare("CALL p()")))
echo "Prepare failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$stmt->execute())
echo "Execute failed: (" . $stmt->errno . ") " . $stmt->error;
do {
$id_out = NULL;
if (!$stmt->bind_result($id_out))
echo "Bind failed: (" . $stmt->errno . ") " . $stmt->error;
while ($stmt->fetch())
echo "id = $id_outn";
} while ($stmt->more_results() &&t; $stmt->next_result());
Happy hacking!
Today the number of URLs analyzed was doubled in both the HTTP Archive (from 17K to 34K URLs) and in the HTTP Archive Mobile (from 1K to 2K URLs).
This is a small step toward our goal of 1 million URLs, but it validates numerous code changes that landed recently:
- 22: update URL lists – Previously the list of URLs to crawl was manually created (by me) from multiple other lists (Alexa, Quantcast, Fortune 500, etc.). Because it was manually created it wasn’t updated frequently. Now the list is based on the Alexa Top 1,000,000 Sites and is updated every crawl.
- 243: handle non-ranked URLs – Some of the URLs crawled up until now are NOT in the Alexa Top 1M. In order to support looking at long term trends (by selecting “intersection“) I wanted to continue crawling these outliers. So the list of URLs that is crawled supports crawling non-ranked websites. This will allow many other nice features that you’ll hear about next week.
- 242: rewrite batch_process.php – There’s a bunch of code for doing the crawl that needed to be made more efficient as we increase two orders of magnitude.
- 68: cache aggregate stats for trends.php – Again, in order to deal with a larger number of URLs and still generate charts quickly, I introduced a caching layer for the aggregate stats.
- #196: Publish a mysql schema dump – Exploring the data is now easier. Instead of having to setup an entire instance of the code, you simply create the tables based on the schema dump and download data that is of interest.
With these and other changes behind us, we’ll continue to increase the number of URLs to reach our goal. There are still some big tasks to tackle including changing the DB schema, increasing the capacity on mobile with more devices or switching to an emulator, and combining these two sites into a single site for easier comparison of desktop & mobile data.
No blog post about HTTP Archive would be complete without some observations. As mentioned earlier, whenever looking at long term trends I choose the intersection – which means the exact same URLs are included in every data point.
The main trend I’ve been noticing is how the size of resources is growing much faster than the number of resources. This growth is most evident in scripts and images. It’s no surprise – the Web is getting bigger. But now we can see where that’s happening and explore solutions.
I also wanted to shout out to Pat Meenan and Guy (“Guypo”) Podjarny. Pat works at Google and is the creator of WebPagetest, which is the foundation for the HTTP Archive (Mobile). Guypo works at Blaze and provides additional infrastructure and devices for all the mobile testing. In addition, there are a growing number of contributors to the open source project. And none of this would be happening without support from our sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, and dynaTrace Software.
Watch for a fun announcement next week.
Starting with PHP mysqli is easy, if one has some SQL and PHP skills. To get started one needs to know about the specifics of MySQL and a few code snippets. Using MySQL stored procedures with PHP mysqli has found enough readers to begin with a “quickstart” or “how-to” series. Take this post with a grain of salt. I have nothing against Prepared Statements as such, but I dislike unreflected blind use.
Using prepared statements with mysqli
The MySQL database supports prepared statements. A prepared statement or a parameterized statement is used to execute the same statement repeatedly with high efficiency.
Basic workflow
The prepared statement execution consists of two stages: prepare and execute. At the prepare stage a statement template is send to the database server. The server performs a syntax check and initializes server internal resources for later use.
The MySQL server supports using anonymous, positional placeholder with ?.
$mysqli = new mysqli("localhost", "root", "", "test");
/* Non-prepared statement */
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT)"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
/* Prepared statement, stage 1: prepare */
if (!($stmt = $mysqli->prepare("INSERT INTO test(id) VALUES (?)")))
echo "Prepare failed: (" . $mysqli->errno . ") " . $mysqli->error;
Prepare is followed by execute. During execute the client binds parameter values and sends them to the server. The server creates a statement from the statement template and the bound values to execute it using the previously created internal resources.
/* Prepared statement, stage 2: bind and execute */
$id = 1;
if (!$stmt->bind_param("i", $id))
echo "Binding parameters failed: (" . $stmt->errno . ") " . $stmt->error;
if (!$stmt->execute())
echo "Execute failed: (" . $stmt->errno . ") " . $stmt->error;
Repeated execution
A prepared statement can be executed repeatedly. Upon every execution the current value of the bound variable is evaluated and send to the server. The statement is not parsed again. The statement template is not transferred to the server again.
$mysqli = new mysqli("localhost", "root", "", "test");
/* Non-prepared statement */
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT)"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
/* Prepared statement, stage 1: prepare */
if (!($stmt = $mysqli->prepare("INSERT INTO test(id) VALUES (?)")))
echo "Prepare failed: (" . $mysqli->errno . ") " . $mysqli->error;
/* Prepared statement, stage 2: bind and execute */
$id = 1;
if (!$stmt->bind_param("i", $id))
echo "Binding parameters failed: (" . $stmt->errno . ") " . $stmt->error;
if (!$stmt->execute())
echo "Execute failed: (" . $stmt->errno . ") " . $stmt->error;
/* Prepared statement: repeated execution, only data transferred from client to server */
for ($id = 2; $id < 5; $id++)
if (!$stmt->execute())
echo "Execute failed: (" . $stmt->errno . ") " . $stmt->error;
/* explicit close recommended */
$stmt->close();
/* Non-prepared statement */
$res = $mysqli->query("SELECT id FROM test");
var_dump($res->fetch_all());
array(4) {
[0]=>
array(1) {
[0]=>
string(1) "1"
}
[1]=>
array(1) {
[0]=>
string(1) "2"
}
[2]=>
array(1) {
[0]=>
string(1) "3"
}
[3]=>
array(1) {
[0]=>
string(1) "4"
}
}
Every prepared statement occupies server resources. Statements should be closed explicitly immediately after use. If not done explicitly, the statement will be closed when the statement handle is freed by PHP.
Using a prepared statement is not always the most efficient way of executing a statement. A prepared statement executed only once causes more client-server round-trips than a non-prepared statement. This is why the SELECT is not run as a prepared statement above.
Also, consider the use of the MySQL multi-INSERT SQL syntax for INSERTs. For the example, belows multi-INSERT requires less round-trips between the server and client than the prepared statement shown above.
if (!$mysqli->query("INSERT INTO test(id) VALUES (1), (2), (3), (4)"))
echo "Multi-INSERT failed: (" . $mysqli->errno . ") " . $mysqli->error;
Result set values data types
The MySQL Client Server Protocol defines a different data transfer protocol for prepared statements and non-prepared statements. Prepared statements are using the so called binary protocol. The MySQL server sends result set data "as is" in binary format. Results are not serialized into strings before sending. The client libraries do not receive strings only. Instead, they will receive binary data and try to convert the values into appropriate PHP data types. For example, results from an SQL INT column will be provided as PHP integer variables.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT, label CHAR(1))") ||
!$mysqli->query("INSERT INTO test(id, label) VALUES (1, 'a')"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
$stmt = $mysqli->prepare("SELECT id, label FROM test WHERE id = 1");
$stmt->execute();
$res = $stmt->get_result();
$row = $res->fetch_assoc();
printf("id = %s (%s)n", $row['id'], gettype($row['id']));
printf("label = %s (%s)n", $row['label'], gettype($row['label']));
id = 1 (integer) label = a (string)
This behaviour differes from non-prepared statements. By default, non-prepared statements return all results as strings. This default can be changed using a connection option (hint: more blog posts coming…). If the connection option is used, there are no differences.
Fetching results using bound variables
Results from prepared statements can either be retrieved by binding output variables or by requesting a mysqli_result object.
Output variables must be bound after statement execution. One variable must be bound for every column of the statements result set.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT, label CHAR(1))") ||
!$mysqli->query("INSERT INTO test(id, label) VALUES (1, 'a')"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!($stmt = $mysqli->prepare("SELECT id, label FROM test")))
echo "Prepare failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$stmt->execute())
echo "Execute failed: (" . $stmt->errno . ") " . $stmt->error;
$out_id = NULL;
$out_label = NULL;
if (!$stmt->bind_result($out_id, $out_label))
echo "Binding output parameters failed: (" . $stmt->errno . ") " . $stmt->error;
while ($stmt->fetch())
printf("id = %s (%s), label = %s (%s)n",
$out_id, gettype($out_id),
$out_label, gettype($out_label));
id = 1 (integer), label = a (string)
Prepared statements return unbuffered result sets by default. The results of the statement are not implicitly fetched and transferred from the server to the client for client-side buffering. The result set takes server resources until all results have been fetched by the client. Thus it is recommended to consume results timely. If a client fails to fetch all results or the client closes the statement before having fetched all data, the data has to be fetched implicitly by mysqli.
It is possible to buffer the results of a prepared statement using mysqli_stmt_store_result().
Fetching results using mysqli_result interface
Instead of using bound results, results can also be retrieved through the mysqli_result interface. mysqli_stmt_get_result() returns a buffered result set.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT, label CHAR(1))") ||
!$mysqli->query("INSERT INTO test(id, label) VALUES (1, 'a')"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!($stmt = $mysqli->prepare("SELECT id, label FROM test ORDER BY id ASC")))
echo "Prepare failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$stmt->execute())
echo "Execute failed: (" . $stmt->errno . ") " . $stmt->error;
if (!($res = $stmt->get_result()))
echo "Getting result set failed: (" . $stmt->errno . ") " . $stmt->error;
var_dump($res->fetch_all());
array(1) {
[0]=>
array(2) {
[0]=>
int(1)
[1]=>
string(1) "a"
}
}
Using the mysqli_result interface this has the additional benefit of flexible client-side result set navigation.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT, label CHAR(1))") ||
!$mysqli->query("INSERT INTO test(id, label) VALUES (1, 'a'), (2, 'b'), (3, 'c')"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!($stmt = $mysqli->prepare("SELECT id, label FROM test")))
echo "Prepare failed: (" . $mysqli->errno . ") " . $mysqli->error;
if (!$stmt->execute())
echo "Execute failed: (" . $stmt->errno . ") " . $stmt->error;
if (!($res = $stmt->get_result()))
echo "Getting result set failed: (" . $stmt->errno . ") " . $stmt->error;
for ($row_no = ($res->num_rows - 1); $row_no >= 0; $row_no--) {
$res->data_seek($row_no);
var_dump($res->fetch_assoc());
}
$res->close();
array(2) {
["id"]=>
int(3)
["label"]=>
string(1) "c"
}
array(2) {
["id"]=>
int(2)
["label"]=>
string(1) "b"
}
array(2) {
["id"]=>
int(1)
["label"]=>
string(1) "a"
}
Escaping and SQL injection
Bound variables will be escaped automatically by the server. The server inserts their escaped values at the appropriate places into the statement template before execution. Users must hint the server about the type of the bound variable for appropriate conversion, see mysqli_stmt_bind_param().
The automatic escaping of values within the server is sometimes considered as a security feature to prevent SQL injection. The same degree of security can be achieved with non-prepared statements, if input values are escaped correctly.
Client-side prepared statement emulation
The API does not include a client-side prepared statement emulation.
Quick prepared - non-prepared statement comparison
The below table gives a quick comparison on server-side prepared and non-prepared statements.
| Prepared statement | Non-prepared statement |
|---|---|
Client-server round trips, SELECT, single execution |
|
| 2 | 1 |
| Statement string transferred from client to server | |
| 1 | 1 |
Client-server round trips, SELECT, repeated (n) execution |
|
| 1 + n | n |
| Statement string transferred from client to server , repeated (n) execution | |
| 1 template, n times bound parameter, if any | n times together with parameter, if any |
| Input parameter binding API | |
| yes, automatic input escaping | no, manual input escaping |
| Output variable binding API | |
| yes | no |
Supports use of mysqli_result API |
|
yes, use mysqli_stmt_get_result() |
yes |
| Buffered result sets | |
yes, use mysqli_stmt_get_result() or binding with mysqli_stmt_store_result() |
yes, default of mysqli_query() |
| Unbuffered result sets | |
| yes, use output binding API | yes, use mysqli_real_query() with mysqli_use_result() |
| MySQL Client Server protocol data transfer flavour | |
| binary | text |
| Result set values SQL data types | |
| preserved when fetching | converted to string or preserved when fetching |
| Supports all SQL statements | |
| Recent MySQL versions support most but not all | yes |
PECL/mysqlnd 1.1.2-stable has been released. The mysqlnd replication and load balancing plugin for PHP 5.3/5.4 finally got the download label it deserves: stable, ready for production use! PECL/mysqlnd_ms makes using any kind of MySQL database cluster easier.
- Download PECL/mysqlnd from pecl.php.net
- Documentation at the PHP Reference Manual
Key features
The release motto of the 1.1 series is “cover MySQL Replication basics with production quality”, which shows that the plugin is optimized for supporting MySQL replication cluster. But with its feature set it is not limited to. MySQL Cluster users will also profit from it.
- Automatic read/write splitting
- can be controlled with SQL hints
- can be replaced providing callback
- can be disabled for MySQL Cluster use
- Load Balancing
- random (pick for every statement or once per request, latter is default)
- round robin (iterate per statement)
- can be replaced providing callback
- can be controlled with SQL hint
- Fail over
- optional, automatic connect fail over
- Connection pooling
- Lazy connections (don’t open before use, default)
The plugin can be used with any PHP MySQL API/extension (mysql, mysqli, PDO_MySQL), if the extension is compiled to use the mysqlnd library. Whatever framework, whatever API you use, it should work out-of-the box. As a library plugin, it operates on its own layer below your application. No or very little application-level changes are required.
PECL/mysqlnd_ms 1.1.1-beta in production use at ihigh.com
Nicholas Solon from ihigh.com, a US high school sports sites contacted us a couple of months ago. We have been very pleased about this. Real-life feedback - feature requests and bug reports - are most welcome. Below is an excerpt from his last mail…
We are finally running 1.1.1-beta from the latest tarball on PHP 5.3.8 with MySQL 5.5.15 on 1 master, 2 slaves (FreeBSD) and using exclusively InnoDB. It’s a production environment, so we’ve been very slow to get this set up, but I’m very pleased with the performance! In this setup, we get about 1.5 million monthly uniques according to Google Analytics. We broadcast live high school sporting events around the US and other parts of the world, so Friday nights are especially load-intense.
(Nicholas Solon, developer at ihigh.com)
From 1.0 to 1.1
The 1.1 version has been significantly re-factored and extended. Many pitfalls on connection state changes have been removed. Connection state changes can happen when switching from one cluster node to another, either for load balancing or for read-write splitting. If you are new to developing software for MySQL replication clusters, please check the concepts section of the manual.
The plugins configuration format is now JSON-based. This was done to prepare for hierarchical and nested configurations. A new filter concept has been introduced. Filters works like small Unix utilities which can be stacked. The manual, which has been extended significantly, explains both in great depth. If you prefer blog posts, check out Replication plugin | filter | conquer = 1.1.0 coming.
What’s next?
Tell us! With the 1.1.0 series we have laid necessary foundations in the code base. From here, we can drive in many directions . We can start to look into Global Transaction IDs, coming to the server soon, or we look into replication table filter rule support, or we refine load balancing rules, or….
A minor, though time-intensive thing we are planning is updating the PHP MySQL documentation.
Happy hacking!
The series Using X with PHP mysqli continues. After notes on calling stored procedures and using prepared statements, its time for a multiple statement quickstart. A mighty tool, if used with care…
Using Multiple Statements with mysqli
MySQL optionally allows having multiple statements in one statement string. Sending multiple statements at once reduces client-server round trips but requires special handling.
Multiple statements or multi queries must be executed with mysqli_multi_query(). The individual statements of the statement string are seperated by semicolon. Then, all result sets returned by the executed statements must be fetched.
The MySQL server allows having statements that do return result sets and statements that do not return result sets in one multiple statement.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT)"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
$sql = "SELECT COUNT(*) AS _num FROM test; ";
$sql.= "INSERT INTO test(id) VALUES (1); ";
$sql.= "SELECT COUNT(*) AS _num FROM test; ";
if (!$mysqli->multi_query($sql))
echo "Multi query failed: (" . $mysqli->errno . ") " . $mysqli->error;
do {
if ($res = $mysqli->store_result()) {
var_dump($res->fetch_all(MYSQLI_ASSOC));
$res->free();
}
} while ($mysqli->more_results() && $mysqli->next_result());
array(1) {
[0]=>
array(1) {
["_num"]=>
string(1) "0"
}
}
array(1) {
[0]=>
array(1) {
["_num"]=>
string(1) "1"
}
}
Security considerations
The API functions mysqli_query() and mysqli_real_query() do not set a connection flag for activating multi queries in the server. An extra API call is used for multiple statements to reduce the likeliness of accidental SQL injection attacks. An attacker may try to add statements such as ; DROP DATABASE mysql or ; SELECT SLEEP(999). If the attacker succeeds in adding SQL to the statement string but mysqli_multi_query() is not used, the server will not execute the second, injected and malicious SQL statement.
Prepared statements
Use of the multiple statement with prepared statements is not supported.
Happy hacking!
Just over a year and a half ago I broke onto the scene with some demos of running YUI on the server with Node.js. This started out as an exercise in just stressing YUI’s modularity and its ability to be used in more places than just the browser.
Back in April of 2010 I started this journey with this blog post followed by one of my most popular ones, Server Side DOM. After this series of articles, I began talking about YUI on Node.js to anyone that would listen. In Sept of 2010 I started with a small town hall style video, then again at YUIConf 2010 in November.
By that time, others had started seeing the possibilities that I had and started doing more awesome things. Matthew Taylor gave a talk after mine at YUIConf 2010 called YUI 3 & Node.js for JavaScript View Rendering on Client or Server. Many people didn’t realize at that time that Matt was working on an internal project at Yahoo! called Mojito, but he was spending some serious time tweaking YUI on Node.js and building some outstanding things. Satyen Desai also gave a talk at YUIConf 2010 called A Phone, a Tablet and a Laptop Walk into a Bar.
In May of 2011 I gave another presentation at our internal F2E Summit called YUI 3 and Node.js – Not Just for Web Pages where I highlighted using YUI on the server/commandline for utilities or services that do not involve web pages.
All of these talks led up to YUIConf 2011, where there were several talks on Node.js, YUI, and mobile development. It has often been said that YUI is not designed for mobile development. Well, we’re here to prove you wrong. One of my favorite quotes heard at YUIConf this year was something like, “YUI doesn’t need to branch and create a mobile code line, their shit just works”. This is something that I fully stand behind. YUI’s flexible module system makes it perfect for building mobile applications. From our core modules, to our conditional loading, to our new App Framework, we’re already powering high-profile mobile web applications today.
This brings me to the launch of Livestand. After launching, Livestand hit the #1 position in both Free Apps and News Apps in the Apple App Store. Not only is this app beautiful to look at, it’s a technical revolution here at Yahoo!. Livestand was built using new technologies that, if you haven’t guessed yet, are all based on YUI on the server and on the client. It brings together all the things I and others have been talking about for almost 2 years now, and delivers them in one fantastic application. In the coming months, the core technology powering Livestand will be released so you can start building kick-ass applications like it.
I’m proud of what the Livestand team has accomplished and I take great pride in knowing that YUI was there to push the boundaries and help them reach their goals.
I’ll leave you with one more quote overheard at YUIConf 2011: “YUI is kind of like a Transformer: more than meets the eye”.
We’re still editing the more than 25 hours of video recorded at YUIConf, but in the meantime here’s a preview of what will be coming to YUI Theater over the next few weeks.
Opening a database connection is a boring tasks. But do you know how defaults are determined, if values are omitted? Or, did you know there are two flavours of persistent connections in mysqli? Of course you, as a german reader, know it. I blogged about it in 2009 over at phphatesme.com (Nimmer Ärger mit den Persistenten Verbindungen von MySQL? ) …
Database connections with mysqli
The MySQL server supports the use of different transport layers for connections. Connections use TCP/IP, Unix domain sockets or Windows named pipes.
The hostname localhost has a special meaning. It is bound to the use of Unix domain sockets. It is not possible to open a TCP/IP connection using the hostname localhost you must use 127.0.0.1 instead.
$mysqli = new mysqli("localhost", "root", "", "test");
echo $mysqli->host_info . "n";
$mysqli = new mysqli("127.0.0.1", "root", "", "test", 3306);
echo $mysqli->host_info . "n";
Localhost via UNIX socket 127.0.0.1 via TCP/IP
Connection parameter defaults
Depending on the connection function used, assorted parameters can be omitted. If a parameter is not given the extension attempts to use defaults values set in the PHP configuration file.
mysqli.default_host=192.168.2.27 mysqli.default_user=root mysqli.default_pw="" mysqli.default_port=3306 mysqli.default_socket=/tmp/mysql.sock
The resulting parameter values are then passed to the client library used by the extension. If the client library detects empty or unset parameters, it may default to library built-in values.
Built-in connection library defaults
If the host value is unset or empty, the client library will default to a Unix socket connection on localhost. If socket is unset or empty and a Unix socket connection is requested, a connection to the default socket on /tmp/mysql.sock is attempted.
On Windows systems the host name . is interpreted by the client library as an attempt to open a Windows named pipe based connection. In this case the socket parameter is interpreted as the pipes name. If not given or empty, the socket (here: pipe name) defaults to \.pipeMySQL.
If neither a Unix domain socket based nor a Windows named pipe based connection is to be bestablished and the port parameter value is unset, the library will default to TCP/IP and port 3306.
The mysqlnd library and the MySQL Client Library (libmysql) implement the same logic for determining defaults.
Connection options
Various connection options are available, for example, to set init commands which are executed upon connect or, for requesting use of a certain charset. Connection options must be set before a network connection is established.
For setting a connection option the connect operation has to be performed in three steps: creating a connection handle with mysqli_init(), setting the requested options using mysqli_options() and establishing the network connection with mysqli_real_connect().
Connection pooling
The mysqli extension supports persistent database connections, which are a special kind of pooled connections. By default every database connection opened by a script is either explicitly closed by the user during runtime or released automatically at the end of the script. A persistent connection is not. Instead it is put into a pool for later reuse, if a connection to the same server using the same username, password, socket, port and default database is used. Upon reuse connection overhead is saved.
Every PHP process is using its own mysqli connection pool. Depending on the web server deployment model a PHP process may serve one or multiple requests. Therefore, a pooled connection may be used by one or more scripts subsequently.
Persistent connections
If no unused persistent connection for a given combination of host, username, password, socket, port and default database can be found in the connection pool, mysqli opens a new connection. The use of persistent connections can be enabled and disabled using the PHP directive mysqli.allow_persistent. The total number of connections opened by a script can be limited with mysqli.max_links. The maximum number of persistent connections per PHP process can be restricted with mysqli.max_persistent. Please note, that the web server may spawn many PHP processes.
A common complain about persistent connections is that their state is not reset before reuse. For example, open, unfinished transactions are not automatically rolled back. But also, authorization changes which happened in the time between putting the connection into the pool and reusing it are not reflected. This may be seen as an unwanted side-effect. On the contrary, the name persistent may be understood as a promise that the state is persisted.
The mysqli extension supports both interpretations of a persistent connection: state persisted and state reset before reuse. The default is reset. Before a persistent connection is reused, the mysqli extension implicitly calls mysqli_change_user() to reset the state. The persistent connection appears to the user as if it was just opened. No artefacts from previous usages are visible.
The mysqli_change_user() function is an expensive operation. For best performance, users may want to recompile the extension with the compile flag MYSQLI_NO_CHANGE_USER_ON_PCONNECT being set.
It is left to the user to choose between safe behaviour and best performance. Both are valid optimization goals. For ease of use, the safe behaviour has been made the default at the expense of maximum performance. Please, run your own benchmarks to measure the performance impact for your work load.
Backups are just one of the many responsibilities of system administrators. IT Generalists have many areas to cover so they probably don’t take the time to make spreadsheets to measure the cost of data loss as they might in The Enterprise. However, investing time in trying to place a value on your backups can provide perspective on just what a terrific responsibility backups can be.
At Stack Exchange, I view our users and our user contributed content as the company’s most valuable asset. We have a lot of talent in the company, and our user contributed content isn’t even our direct source of revenue. However, if this data were totally lost (or a large portion of it) I have trouble envisioning how the company could bounce back from that. In addition to this, as a user myself I value this data as something we have created together that has intrinsic value for our professions.
Measuring Value
There are lots of ways to measure value. The obvious method is to use traditional business methods that put a dollar value on your company. When it comes to Stack Exchange some people somewhere put a big dollar value on the company which they call our valuation and in theory they don’t just make this up. If I accept that the loss of our user contributed content is the loss of the company, I could just say that our valuation is the value of our backups. The problem is that valuations tend to be pretty big numbers, and the abstraction there just doesn’t speak to me.
Also from a business perspective I can use the $18 million of VC funding we have taken and use that as a basis for value of our backups. That is a lot of money and I can’t help but start to feel the sense of importance of these backups. However, there is still a lot of abstraction there. The point of this exercise is to really feel the responsibility and not just be intellectually aware of it.
Another way to measure value is time. Our users and coworkers collectively have invested incredible amounts of time into our sites. I am user and know many of our users so I know that what we have created is important to us. I don’t have an accurate way to measure this, but I can do a back of the envelope calculation for Stack Overflow. To be conservative, looking only at the 1.4 million accepted answers on stackoverflow.com the total word count is about 100 million. According to Wikipedia people write about 19 words per minute, but I will assume people on SO are faster and can compose about 40 words per minute. That gives us 100,000,000 words / 40 words per minute / 60 minutes per hour / 24 hours a day / 365 days a year =~ 5 years of non-stop skilled work. Now I realized this calculation is perhaps, a bit, well, hair-brained, but it is reasonable for my purposes.
Another aspect to take into account is the profit generated by Stack Exchange. I don’t mean profit in the traditional sense, rather I look at what I call time profit. When a user answers someones question, they not only saved that person time but many other people who will eventually search for the same question and find that answer. This saves those people time. Because of this our sites like Stack Overflow are systems where the output is greater than the input. So in this sense of time profit, if our content was lost, future potential time profit would be lost.
We all have different ways of perceiving value. I value what our users and my coworkers have created, and when I attempt to measure just how much has been created, it becomes very apparent that safe guarding that creation though backups is an awesome responsibility.
Backups are just one of the many responsibilities of system administrators. IT Generalists have many areas to cover so they probably don’t take the time to make spreadsheets to measure the cost of data loss as they might in The Enterprise. However, investing time in trying to place a value on your backups can provide perspective on just what a terrific responsibility backups can be.
At Stack Exchange, I view our users and our user contributed content as the company’s most valuable asset. We have a lot of talent in the company, and our user contributed content isn’t even our direct source of revenue. However, if this data were totally lost (or a large portion of it) I have trouble envisioning how the company could bounce back from that. In addition to this, as a user myself I value this data as something we have created together that has intrinsic value for our professions.
Measuring Value
There are lots of ways to measure value. The obvious method is to use traditional business methods that put a dollar value on your company. When it comes to Stack Exchange some people somewhere put a big dollar value on the company which they call our valuation and in theory they don’t just make this up. If I accept that the loss of our user contributed content is the loss of the company, I could just say that our valuation is the value of our backups. The problem is that valuations tend to be pretty big numbers, and the abstraction there just doesn’t speak to me.
Also from a business perspective I can use the $18 million of VC funding we have taken and use that as a basis for value of our backups. That is a lot of money and I can’t help but start to feel the sense of importance of these backups. However, there is still a lot of abstraction there. The point of this exercise is to really feel the responsibility and not just be intellectually aware of it.
Another way to measure value is time. Our users and coworkers collectively have invested incredible amounts of time into our sites. I am user and know many of our users so I know that what we have created is important to us. I don’t have an accurate way to measure this, but I can do a back of the envelope calculation for Stack Overflow. To be conservative, looking only at the 1.4 million accepted answers on stackoverflow.com the total word count is about 100 million. According to Wikipedia people write about 19 words per minute, but I will assume people on SO are faster and can compose about 40 words per minute. That gives us 100,000,000 words / 40 words per minute / 60 minutes per hour / 24 hours a day / 365 days a year =~ 5 years of non-stop skilled work. Now I realized this calculation is perhaps, a bit, well, hair-brained, but it is reasonable for my purposes.
Another aspect to take into account is the profit generated by Stack Exchange. I don’t mean profit in the traditional sense, rather I look at what I call time profit. When a user answers someones question, they not only saved that person time but many other people who will eventually search for the same question and find that answer. This saves those people time. Because of this our sites like Stack Overflow are systems where the output is greater than the input. So in this sense of time profit, if our content was lost, future potential time profit would be lost.
We all have different ways of perceiving value. I value what our users and my coworkers have created, and when I attempt to measure just how much has been created, it becomes very apparent that safe guarding that creation though backups is an awesome responsibility.
The mysqli quickstart series is coming to an end. Today, the post is about non-prepared statements. You may also want to check out the following related blog posts:
- Using MySQL prepared statements with PHP mysqli
- Using MySQL multiple statements with PHP mysqli
- Using MySQL stored procedures with PHP mysqli
Using mysqli to execute statements
Statements can be executed by help of the mysqli_query(), mysqli_real_query() and mysqli_multi_query() function. The mysqli_query() function is the most commonly used one. It combines executing statement and doing a buffered fetch of its result set, if any, in one call. Calling mysqli_query() is identical to calling mysqli_real_query() followed by mysqli_store_result.
The mysqli_multi_query() function is used with Multiple Statements and is described here.
$mysqli = new mysqli("example.com", "user", "password", "database");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT)") ||
!$mysqli->query("INSERT INTO test(id) VALUES (1)"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
Buffered result sets
After statement execution results can be retrieved at once to be buffered by the client or by read row by row. Client-side result set buffering allows the server to free resources associated with the statement results as early as possible. Generally speaking, clients are slow consuming result sets. Therefore, it is recommended to use buffered result sets. mysqli_query() combines statement execution and result set buffering.
PHP applications can navigate freely through buffered results. Nagivation is fast because the result sets is hold in client memory. Please, keep in mind that it is often easier to scale by client than it is to scale the server.
$mysqli = new mysqli("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT)") ||
!$mysqli->query("INSERT INTO test(id) VALUES (1), (2), (3)"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
$res = $mysqli->query("SELECT id FROM test ORDER BY id ASC");
echo "Reverse order...n";
for ($row_no = $res->num_rows - 1; $row_no >= 0; $row_no--) {
$res->data_seek($row_no);
$row = $res->fetch_assoc();
echo " id = " . $row['id'] . "n";
}
echo "Result set order...n";
$res->data_seek(0);
while ($row = $res->fetch_assoc())
echo " id = " . $row['id'] . "n";
Reverse order... id = 3 id = 2 id = 1 Result set order... id = 1 id = 2 id = 3
Unbuffered result sets
If client memory is a short resource and freeing server resources as early as possible to keep server load low is not needed, unbuffered results can be used. Scrolling through unbuffered results is not possible before all rows have been read.
$mysqli->real_query("SELECT id FROM test ORDER BY id ASC");
$res = $mysqli->use_result();
echo "Result set order...n";
while ($row = $res->fetch_assoc())
echo " id = " . $row['id'] . "n";
Result set values data types
The mysqli_query(), mysqli_real_query() and mysqli_multi_query() functions are used to execute non-prepared statements. At the level of the MySQL Client Server Protocol the command COM_QUERY and the text protocol are used for statement execution. With the text protocol, the MySQL server converts all data of a result sets into strings before sending. This conversion is done regardless of the SQL result set column data type. The mysql client libraries receive all column values as strings. No further client-side casting is done to convert columns back to their native types. Instead, all values are provided as PHP strings.
$mysqli = mysqli_init();
$mysqli->options(MYSQLI_OPT_INT_AND_FLOAT_NATIVE, 1);
$mysqli->real_connect("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT, label CHAR(1))") ||
!$mysqli->query("INSERT INTO test(id, label) VALUES (1, 'a')"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
$res = $mysqli->query("SELECT id, label FROM test WHERE id = 1");
$row = $res->fetch_assoc();
printf("id = %s (%s)n", $row['id'], gettype($row['id']));
printf("label = %s (%s)n", $row['label'], gettype($row['label']));
id = 1 (string) label = a (string)
It is possible to convert integer and float columns back to PHP numbers by setting the MYSQLI_OPT_INT_AND_FLOAT_NATIVE connection option, if using the mysqlnd libary. If set, the mysqlnd library will check the result set meta data column types and convert numeric SQL columns to PHP numbers, if the PHP data type value range allows for it. This way, for example, SQL INT columns are returned as integers.
$mysqli = mysqli_init();
$mysqli->options(MYSQLI_OPT_INT_AND_FLOAT_NATIVE, 1);
$mysqli->real_connect("localhost", "root", "", "test");
if (!$mysqli->query("DROP TABLE IF EXISTS test") ||
!$mysqli->query("CREATE TABLE test(id INT, label CHAR(1))") ||
!$mysqli->query("INSERT INTO test(id, label) VALUES (1, 'a')"))
echo "Table creation failed: (" . $mysqli->errno . ") " . $mysqli->error;
$res = $mysqli->query("SELECT id, label FROM test WHERE id = 1");
$row = $res->fetch_assoc();
printf("id = %s (%s)n", $row['id'], gettype($row['id']));
printf("label = %s (%s)n", $row['label'], gettype($row['label']));
id = 1 (string) label = a (string)
Happy hacking!
So at work we are finalizing the setup of a new server environment. The site is in PHP and the code is all in SVN. We were trying to decide what process to use to export the SVN contents to the site and that’s where I decided to learn how to write a bash script. This is my first and with some help from Jess we created the following script. The script does the following:
- Does an info on the remote repo to get the revision number
- Checks against local revision number which is stored in a file
- If the revision numbers don’t match, it does a diff on both revisions and creates an list with the files that were changed
- It then loops through each file and exports it to the site
- Finally it stores the new revision number in the file
# need to figure out what to do on files that need to be deleted
TARGET_DIR=‘/path/to/site’
REPO="svn://path.to.svn/repo"
REVISION_FILE=‘.revision’
echo "Getting info from remote repo"
REMOTE_VERSION=$(svn info $REPO | grep Revision)
REMOTE_VERSION=${REMOTE_VERSION: -4} # need to update to not hardcode 4 spaces back
CURRENT_VERSION=$(more $REVISION_FILE)
echo "Current Revision: $CURRENT_VERSION"
echo "Remote Revision: $REMOTE_VERSION"
if [ "$REMOTE_VERSION" -eq "$CURRENT_VERSION" ]
then
echo "No export needed"
exit 0
fi
echo "Getting diffs between revisions"
difflines=`svn diff –summarize -r $CURRENT_VERSION:$REMOTE_VERSION $REPO 2>&1 | awk ‘{print $2}’`
URL_LENGTH=${#REPO}
for i in `echo $difflines`; do
FILENAME=${i:$URL_LENGTH}
echo "svn export ${i} ${TARGET_DIR}${FILENAME}"
svn export ${i} ${TARGET_DIR}${FILENAME}
done
echo "Saving revision number"
echo ${REMOTE_VERSION} > $REVISION_FILE
Elastic, fantastic: click here to add a MySQL replication database cluster to your cloud configuration. Click - yes, we can! Just one little thing, you need to update your application: consistency model changed. Plan for it. Some thoughts for PECL/mysqlnd_ms 1.x, the PHP mysqlnd replication plugin.
Problem: C as in ACID is no more
A MySQL replication cluster is eventual consistent. All writes are to be send to the master. A write request is considered successful once the master has performed it.
|
|||||||||
| | | |||||||||
| set(id = 1) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Client 1 | |||||||||
Then, the master sends the update to the slave. Slave updates are asynchronous. For a short period, until all slaves have replicated the update, there is read inconsistency. The inconsistency window even exists with semisynchronous replication. Semisynchronous guarantees that the update information is stored on at least one slave but there is no promise its already available at the SQL layer.
|
|||||||||
| | | |||||||||
| get(id) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Client 2 | |||||||||
Regarding the C in ACID… Transactions still exist. But as you can see, the C is no more when you take a step back to look at the cluster as a whole. Thus, application developers must plan for inconsistency.
Analysis: consistency levels
Any of the three consistency levels listed can be achieved with MySQL replication. Although, I must confess, strong consistency means using the master for all requests. That’s not what elastic-fantastic is about…
- Eventual consistency
Node may not have data, node may service stale version, node may serve current version. If no new updates happen, after the inconsistency windows has passed, all nodes will eventually return the same current data. Default with MySQL replication and PECL/mysqlnd_ms 1.1. - Session consistency
Read-your-writes. One client is guaranteed to see its updates for the duration of a session. PECL/mysqlnd_ms 1.1 has master_on_write, which is a tiny step into this direction. - Strong consistency
All clients get a consistent view after a successful update.
Mistake: this is left to the developer
PECL/mysqlnd_ms 1.1 leaves it to the user to implement a choice, a certain service level. Developer have to use the SQL hintsMYSQLND_MS_SLAVE_SWITCH, MYSQLND_MS_LAST_USED_SWITCH, MYSQLND_MS_MASTER_SWITCH …
Wrong, stop it, thinking from the past and for 1.very_low releases! As a cloud user, I do not want to have to bother about details of the MySQL cluster! As a cloud users, I want a ready-to-use service. As a cloud user, I want to define the service level I need and see the database cluster deliver - within its capabilities. "Cloud" is used as a synonym for our times here.
Proposing service levels
As a cloud user, I want to set the service level like this. I don’t care much how its implemented as long I don’t have to fine-tune the setup.
$mysqli->query("SELECT product_id, title, price FROM products");
$mysqli->query("INSERT INTO shopping_cart(cart_id, product_id) VALUES (123, 456)");
mysqlnd_ms_set_service_level(MYSQLND_MS_SESSION_CONSISTENCY);
$mysqli->query("SELECT COUNT(*) FROM shopping_cart WHERE cart_id = 123");
Let’s see what service contracts PECL/mysqlnd_ms could fulfill with todays MySQL replication cluster capabilities and how.
Client-side consistency level variations
Eventual consistency and session consistency could be parametized.
- Eventual consistency
- default - no parameter
Service guaranteed: None. Data may be unavailable, stale or current - max_age
Service guaranteed: No data served which is more thanmax_ageseconds old.
- default - no parameter
- Session consistency
- default - no parameter
Service guaranteed: read-your-writes for the duration of a web request. - session id
Service guaranteed: read-your-writes in all sessions that share a session id.
- default - no parameter
- Strong consistency
- default - no parameter
Service guaranteed: consistent view on successful updates.
- default - no parameter
If you compare this list with that of other players in the market, you’ll find amazing similarities…
Implementation considerations
The introduction of service levels into PECL/mysqlnd_ms - or at any other place on the client - will decrease performance. A client cannot ask a MySQL replication cluster for a list of slaves that lag no more than max_age seconds or have replicated a session id (or global transaction id) already with only one request. There is no central instance which can answer replication cluster node status requests.
Instead, a client has to contact the cluster nodes and check their status. Imagine 100 concurrent PHP clients of which each contacts all replication nodes to check status and forget their status at the end of the web request (remember - PHP default lifetime = web request). Hmm, that is 100 multiplied by … too much. Too much additional traffic, too many messages, too much latency.
Something in-between a central server-side instance and all clients check themselves is needed. Towards the central instance side dedicated daemon processes, proxies, web server plugins or job queues come to mind. All those could poll node status information either periodically or on-demand. Status information could be shared between many web requests (= PHP and PECL/mysqlnd_ms runs). But, who of us wants an extra piece in the stack? Sure, if your PHP MySQL server farm is huge, you already have a monitoring deamon running or, you are after SaaS/PaaS, it seems an acceptable option.
Or, data collection is not triggered externally and nodes report their status. This could be done with a MySQL server plugin. But then, where to send the information?
Joe-Dolittle could be happier if PECL/mysqlnd_ms copies the PHP session module approach to garbage collection. Garbage collection can be probability based. Its performed - at the average - every n web requests. The query cache plugin has copied this strategy.
PECL/mysqlnd_ms could check node status every n web requests and cache information somewhere. Somewhere could be process memory, shared memory, memcache - whatever. This could and should help to reduce the number of node status requests from clients.
… just try it?!
I think, we should add the idea of a service level to PECL/mysqlnd_ms regardless of potential performance issues. In the first step, we could always fall back to choosing the master. The master can fullfill all service levels. If nothing else, an API to set a service level makes our API better and cleaner. It is a reasonable step towards hiding cluster details from the user - just as it should be in modern times…
As we go, we can try to reduce the cases in which the master is queried, if need be.
Eventual consistency combined with a maximum age is a relatively soft service level. If, for example, max_age=2 and the system does 1000 req/s, an update of the cached slave lag list every 500 requests seems reasonable.
Choosing slaves for the most basic variant of session consistency (session bound to one web request, no user-supplied session id) could be doable by checking for global transactions ids. When it comes to user-supplied session ids, things are likely to get nasty and slow.
Thoughts?
At YUIConf 2011 last week, we set up a video camera and invited attendees to tell us how they use YUI and why they chose it for their projects. We were thrilled to hear their stories, and we’d love to hear yours as well! After checking out the video, leave a comment and tell us how you use YUI.
In #zf2 news, effective immediately, we no longer require a CLA for #zf2 contributions. Let the pull requests flow! – @weierophinney
Matthew Weir O’Phinney has announced that contributors to Zend Framework 2 do not need to have signed Zend’s Contributor License Agreement from now on. Zend Framework 2 is developed using git and there’s a mirror on github, this means that contribution to ZF2 is now just a pull request away!
During ZendCon this year, we released 2.0.0beta1 of Zend Framework. The key story in the release is the creation of a new MVC layer, and to sweeten the story, the addition of a modular application architecture.
"Modular? What's that mean?" For ZF2, "modular" means that your application is built of one or more "modules". In a lexicon agreed upon during our IRC meetings, a module is a collection of code and other files that solves a specific atomic problem of the application or website.
As an example, consider a typical corporate website in a technical arena. You might have:
- A home page
- Product and other marketing pages
- Some forums
- A corporate blog
- A knowledge base/FAQ area
- Contact forms
These can be divided into discrete modules:
- A "pages" modules for the home page, product, and marketing pages
- A "forum" module
- A "blog" module
- An "faq" or "kb" module
- A "contact" module
Furthermore, if these are developed well and discretely, they can be re-used between different applications!
So, let's dive into ZF2 modules!
Posted by CraigBradford
If someone told you that there was a quick and easy way that many of you could improve your SERP CTR for minimal effort, you'd all stop in your tracks and give them full attention. Yet, Schema.org and rich snippets are still horribly under-utilized.
Since Google (and Bing!) officially introduced schema.org in June, it’s fair to say motivation to implement it has been mixed. However since its introduction Schema.org has already evolved a lot, adding a lot of new stuff that people haven’t paid attention to. Here I try to persuade you there are few downsides and plenty of upsides.
Myth: Schema.org markup doesn’t get rich snippets!
A common objection I hear to people not using Schema is that there’s no point because Google don’t use it for rich snippets. WRONG!
This was true, but is no longer; lots of websites in different markets have taken a leap of faith and are seeing the benefits in the form of rich snippets.
Examples of Schema.org Rich Snippets Showing in Google
The following are all examples of websites that are currently using the Schema.org vocabulary:
E-Commerce


Image Source | See The Example Page
TV Series


Image Source | See The Example Page
Movies


Image Source | See The Example Page
Events


Image Source | See The Example Page
Recipe


Image Source | See Example Page
As you can see Schema.org is definitely being used by Google.
Schema.org is not a language.
Schema.org is a Microdata vocabulary; not a language in and of itself. Let me explain the difference, as there is still a lot of confusion in the SEO community.
There are various languages that do the job we're discussing:
- Microformats
- Microdata
- RDFa
When marking up any content on a page for rich snippets or similar machine readable reasons, the method of doing so is always a mix between one of these and a vocabulary. See the example below of using Microdata with the schema.org vocabulary.

Of the language and vocabulary above, it’s the vocabulary part that all the search engines have agreed to standardize with schema.org.
When Google originally announce that they were going to support the Schema.org vocabulary, they also dropped the bombshell that they supported only Microdata.

They also said that although they would continue to support the existing rich snippets markup, you should avoid mixing the formats together as it can confuse their parsers.

The fact that you couldn’t mix Schema.org and microformats or RDFa annoyed a lot of people and as @TomAnthonySEO pointed out in his HTML5 blog post, Kavi Goel from Google later said this was a mistake and they are fixing it. You can read the discussion here
BREAKING NEWS: 2 days ago a pretty big announcement was made on the Schema.org blog. There are plans in the pipeline for the Schema.org vocabulary to be used with the RDFa language; with support for using other vocabularies on the same page.
5 underused Schema.org applications
I personally believe that Schema.org is the future and if you’ve not already done so, you should be implementing it right now. Regardless of what type of website you have, there are always ways you can use Schema.org, even if it’s simply defining an article and the publish date.
That being said, there are cases where I think you can gain even more by implementing it, here are my top 5 examples of ways I think Schema.org should be getting used.
Events.
The event schema lets you get really specific about what type of event you are describing. Right now you can specify an event as any of the items shown in the image below.
With the recent QDF update, it’s important that you give Google as much information as possible. Events by their very nature are obviously time sensitive so using schema.org to enforce event details is obviously a good idea.
The events schema is a pretty comprehensive vocabulary, you are able to markup things like; attendees, duration, performers, location and the start and end date. For more information see the events page: http://schema.org/Event.
Jobs
I don’t think I can describe how amazing this is. The jobs markup is a recent addition to the schema.org vocabulary and was announced last week on the schema blog. Even more amazing is what was announced on the Google blog today. Google have just launched a custom search engine that specifically looks for Schema.org job markup. The custom search engine is used to find veteran-committed job openings. You can read the blog post here.
I would love to see search related queries returning results like that in the example below.

Reputation Management
This isn’t ground breaking so I’ll make it quick. Make use of the Person Schema to make the best page online about the person in question. Not only can you mark up the obvious things like name and age, you can use the tiny details such as what university they went to (alumniOf), what awards they have won (awards), where they work (worksFor), who their colleagues are (colleagues) and even who their family are (parents, siblings, spouse, relatedTo). This is an easy way to make a super targeted page around a single person. I tried this on my own blog and marked up as much as I could. (Disclaimer: I’ve not got round to actually writing a blog post yet but you can see my about Craig page that I used as a schema test. When you put this page into the Google rich snippets tool, look how much information they are now able to extract.

That is an amazing amount of information and is now obviously an awesome result to Display when some one searches my name. This is how it would look in the SERPS as well.

News Sites
The recent QDF update reinforces how committed Google are to displaying fresh content where appropriate. Schema has now extended the vocabulary to include a section specifically for the news industry. This now allows you to reference a particular page or column in the physical paper edition if appropriate. The image below shows the recent additions.

News sites should be using this to markup to tell the search engines what their content is about and when it was published.
E-commerce
I can’t believe how many e-commerce websites I see without any markup at all. People spend so much time trying to rank higher and forget to get the low hanging fruit. Rich snippets are an amazing way to increase click through rates by drawing attention to your listing. The Ebay example shows how much the stars help make the listing stand out.

Wrap up
I hope I’ve managed to convince you that Schema.org is worth implementing right now.
There are already the benefits of rich snippets to be had but this isn’t just about rich snippets; it’s about creating content that machines can understand and reference. There are already services that try to make use of this kind of information such as Silk, Apples Siri and potentially Wiredoo. Ensuring that you are ahead of your competitors can only be a good thing.
For those of you that don’t know me, my name is Craig Bradford and I work at Distilled as an SEO consultant. If you have any questions please leave comments or ask me on twitter @CraigBradford
Posted by JonQ
This post was originally in YOUmoz, and was promoted to the main blog because it provides great value and interest to our community. The author's views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
Often when client arrives in need of links, it can be fairly daunting trying to figure out how the heck you’re going to get the link juice you need. Coming up with a structured plan that works is something I’ve been trying to improve over the last year or so, and I’m pretty sure it’s something I’ll be refining for many years to come. At the start of every campaign I’m involved with, I try to sit down and thrash out a load of ideas in an effort to come up with a link-building ‘road map’ to follow for the coming months.
I find having a solid plan useful in two ways. Firstly, for the client, I think it’s really good for them to understand what you are doing with the time they are paying for. In my experience, it really helps to sit down with them and say I’m going to be doing this much of ‘x’ and this much of ‘y’ because of ‘a, b, and c.' Being able to report back on this structured activity will definitely go down well with your clients. This open explanation of your plan creates a good transparent relationship with the client, and hopefully, one that will also stand the test of time.
Secondly, having a clear plan to follow is brilliant for me. Having a clear set of tasks allows me to manage my time much more effectively and ensures that I don’t fall behind on anything. Being freelance, I don’t have a boss to keep check on me so it’s vital that I keep track of what I’m working on and what I need to work on before the month is over. Sure, things are likely to change along the way, but it’s always useful to understand what you’re changing and why.
Here are some of the things I like to think about when coming up with a game plan.

1. Requirements – What Links Do I Need?
The first part of any solid link-building strategy should be trying to establish what links your client needs. Depending on the situation, you might either need to do a full backlink analysis or alternatively spend half-an-hour or so getting a quick ‘feel’ for things. Either way, it’s a step that shouldn’t be skipped.
For me, this falls into two stages:
Checking existing links – Using both Open Site Explorer (OSE) and Majestic SEO, I try to build up a good picture of what links are already coming into the website. How many linking root domains are there? What’s the anchor text balance like? How have they been building links in the past? All of these questions will go some way to determining what type of links I might want to prioritise.
For example, if your new client has a brand new domain with no links, then you will probably want to tread extremely carefully with your link-building. However, you might find out that your new client has gone way overboard with exact match anchor text, which will mean that balancing out the anchor text should become a focus for you.
Checking the competition – This is where things can start to get really interesting. Checking out the competition is a vital step in understanding what you might be going up against. Armed with that information, you can start to get an idea of what you might need to do in order to rank well, and how long it might take you to get there. Having some good insight into your competitor's link profile will also help you to track changes and understand shifts in the SERPS; all great information to be armed with!
I usually start this process off by tracking who’s ranking in the top ten for a variety of my main keywords. Once I’ve got a good idea of who’s hanging around, I’ll then download a full OSE report for the top ten results for each keyword. I can then look at numbers of linking root domains, anchor text spread, and many more things that will help determine what I might need to do. Justin Briggs wrote an amazing post on link analysis that goes into some great detail on the subject; I strongly suggest you read it!
Key Questions:
- How many linking root domains shall I aim for?
- What anchor text am I aiming for?
- How is the competition getting their links?
2. Timescales and Budgets
This is sadly one of the biggest factors that can affect your potential link-building strategy. It’s important to get a good idea of how much budget and time you will have available to you before you start thinking up a load of wonderful ways to build links. There’s no point in dreaming up ways to start promoting your amazing infographics and embeddable content if you don’t have any budget to create anything. That being said, there are always ways to build content and links for any budget (within reason of course!)
If the budget is tight, then it might be worth considering writing some great guides and resources to help establish your client as a trustworthy source of information. So long as you have the time to research and be creative, writing a good piece of link-worthy content shouldn’t have to cost the earth.
Key Questions:
- How much money do I have to spend on content?
- Do I have a budget for high-level directory submissions, press release distribution, etc?
- How much time can I give to this project?
3. Resources
By resources, I mean anything. Anything that you can draw on to help enhance what you’re doing. This is where being sociable, friendly, and a little bit persuasive can really help with your link-building. Do you have great designers you can call on? Do you know some fantastic writers? Does either you or the client have specialist knowledge that could be called on to create some useful resources? The word "resources" doesn’t just have to mean financial resources and number of staff; in my book, it means ANYTHING that could be useful in creating content, spreading the brand, and of course, gaining some juicy links.
Key Questions:
- Who do I know?
- Can the client get involved?
- How creative can we get?!

(Figuring out a plan getting to you? Just don’t end up like Crazy Harry... Photo credit)
4. Content – Post or Host?!
We all know that hosting great content on your website can help establish you as a great source of information, and hopefully start to bring in links naturally. So it’s definitely something you need to think about. But placing content on other websites is also a great way of building links, especially if you’re a new website trying to build a reputation from scratch.
Hosting content on your own site – Personally I see this as a must for any website. If your content and website sucks, then your success is going to be relatively limited. Writing great resources and promoting your own great content will help you build traffic, links, and social activity. However, can the website easily facilitate new content? Is your client willing to promote free content? These are a couple of things that could stand in your way, and working out how to get round them should definitely be planned for.
Posting content on other sites – If you’re working on a new website, then it might be some time until the links start to build up naturally. Going out and placing content on other websites is a fantastic way to build links and reputation. Using services like MyBlogGuest will help you to find some really good websites that are looking for content in your niche.
Key Questions:
- How much content can I/we create?
- Who’s going to be working on the content – me/client/third party?
- Where can I find a list of potential sites to post content on?
5. Specific Tasks
By now, you should be gathering a few pretty decent ideas together of where you might be headed with your link-building campaign. The real skill is turning all of this information into realistic tasks that can fit into the timescales the project allows. I think the key here is being ‘realistic.' Your strategy has to work for the project and give the client as much value as possible, but also not cause you to be overworked and underpaid.
I don’t think it’s very valuable to say ‘we’re going to make some link bait.' It’s far better to come up with specific tasks such as:
- Source a designer
- Gather a list of key industry figures/bloggers
- Release the content via your social network/paid discovery
- Track key metrics of the latest link bait
Going back to the point I made at the beginning, it’s always really useful to have a list of tasks to keep yourself in check and also to help feedback on progress to the client. Knowing what you need to do and when should help keep the wheels rolling. There are plenty of project management tools out there, but I tend to use a simple spreadsheet with a tab for ‘each task area.' Each tab can then contain specific month-by-month details of each task, with a detailed breakdown of the steps along the way:

Spending that little bit of extra time making some detailed plans should help you to work more efficiently and to keep focused throughout.
6. Don’t Fear Change
Whatever your plan includes, try not to worry about changing it along the way if you find that something isn’t working out as well as you might have hoped. It’s often the case that some things work out really well and produce more than what you expected, while other things simply never take off. Try to carry the mantra of ‘fail fast.' If something’s not working out, then tweak, change, and tweak again until you hit that magic balance.
Having a detailed plan will mean that you can track everything you’re doing, so any changes you make will hopefully be well-informed.
As a last note, I thought I’d mention a few of the best resources I’ve read recently (SEOmoz and others) that have definitely helped shape the way I plan and research link building strategies. If you haven’t read these then go and do it now!
- Clockwork Pirate – Free link building EBook from Kelvin Newman
- Guide to Competitive Backlink Analysis – Justin Briggs
- Actionable Link-Building Strategies – Paddy Moogan
- Competitive Backlink Analysis – Jane Copland
- Effective Link Building - Justin Briggs Webinar
About me: I run my own SEO consulting business Go Search Marketing and have worked with a large variety of clients in different industries. I also have the pleasure of running my own ecommerce site The Jewellery Boutique. Feel free to come say hi and pop me any questions on Twitter @jonquinton1.
We’re pleased to announce the immediate availability of version 2.4.7 of the YUI Compressor. This version contains fixes to Compressor’s handling of CSS minification in a couple of core areas. It does not contain any JS Compression changes.
CSS minification
- Fixed data URL handling, so that it large data URL values don’t crash or slow down CSS Compression.
- Fixed hex color value compression logic (#AABBCC -> #abc), so that the Compressor doesn’t inadvertently compress ID selectors (#AddressBook {…}) .
- All Java CSS Compressor fixes have been ported to the JS Compressor.
- All fixes are backed up by unit tests.
Links
YUI Compressor 2.4.7 is available for immediate download. Feel free to help us out by filing a bug or feature request, contributing code or tests or joining the conversation.
With YUIConf 2011 still fresh in our minds, we have now turned our attention to planning and early development on the next major release of YUI. We are targeting the following high-level goals for YUI 3.5.0:
- Formalization of Node.js as a first-class environment, including clean-up of stability and performance issues and CI integration
- Formal introduction of our second skin offering
- Refactoring of the Get utility for additional feature support and performance enhancements
- Refactoring of Loader to implement Get’s asynchronous functionality
- HTML5 implementation of the Uploader component
- Enhancements to App, Charts, and DataTable components
- Introduction of Button, Carousel, CSSButton, Dialog, TestConsole, Tooltip, and TreeView components
- Enhanced documentation of application development with YUI
We are planning a 3-sprint cycle of development for 3.5.0, with the following target milestones:
- PR1 in mid December 2011
- PR2 in late January 2012
- PR3 in mid March 2012
- GA release in late March 2012
The list of issues targeted for 3.5.0 PR1 and the full list of issues targeted for 3.5.0 are available in our bug tracker.
We look forward to getting great community feedback on our PRs!
Posted by randfish
On Tuesday, October 18th, Google announced they'd be hiding search referral data for logged-in Google searchers. When questioned by Danny Sullivan of SearchEngineLand, Google provided the following estimate on the impact to search referral data:
"Google software engineer Matt Cutts, who’s been involved with the privacy changes, wouldn’t give an exact figure but told me he estimated even at full roll-out, this would still be in the single-digit percentages of all Google searchers on Google.com"
Tragically, it appears that Cutts was either misinformed or gave misleading information, as "(not provided)" became a major referrer for many websites, climbing into double digits in early November. Now, that percentage has risen even higher, into the 20%+ range on many sites. Hubspot's Brian Whalley reported that the average website using their analytics lost 11.36% of keyword referral data and 423 sites lost more than 20% (15 unlucky souls lost 50%+, which seems almost crazy).
In an attempt to better quantify the impact, we ran a small survey last week, asking fellow marketers to supply information about the impact to their sites.
Here's a visualization of 60 sites' analytics data, showing the self-reported percent of their Google search traffic that used keyword "(not provided)":

Our average in the 6 days from Nov. 4-10 almost exactly matches the average of the several thousand Hubspot customers (11.36% vs. 12.02%), and thus makes me feel pretty good about that data from the survey-takers.
A little more about these 60 respondents:
- We collected 66 finished surveys, but scrubbed 6 that had missing, suspicious or improperly filled-out information
- The types of sites reporting data included a wide variety, as illustrated below:

- The sites included in the survey also included a solid variety of traffic numbers. The distribution below reports visits from Google organic search recorded in October, 2011:

- We asked the respondents what level of impact this change had on their content and marketing efforts, and received the following distribution of replies:

Approximately 1/5th of those surveyed reported no impact on their content/marketing efforts, which likely suggests those folks don't typically use keyword-level data to help them improve OR the change hasn't cost them enough data to have a negative impact. Another 1/5th claimed a strong impact, which is likely how I'd describe this change for our internal efforts. Granted, we don't actively use this data every week, but we've relied on it heavily for reporting and in the past for audits around content optimization and the generation of new content (or updating/refreshing of old material).
Here's numbers and a visualization of the referrer encryption data specifically for SEOmoz.org:

From Oct. 19th - 30th, Google sent 163,909 visits from organic search to our website. 3,762 of those visits, or 2.3%, were via keyword "(not provided)". We didn't sweat this too much. As per Matt Cutts' promise, it was in single digits and, while frustrating, had a very tiny impact on our analytics, marketing and content optimization efforts.
But from Oct. 31st to November 13th, Google sent 191,726 visits and 35,168 of these came via keyword "(not provided)," 18.34%. This has a serious impact on our ability to make our website better for visitors (in particular, identifying keywords that are sending traffic but potentially not having a great experience that we should be making new blog posts, videos, updates, etc. to help).
To me, that's the most tragic part of this change. The underselling of the change as being "single digits" was lame. The hypocrisy around keyword privacy sucks. And their motivations are questionable at best. But the crummiest part is the impact the change will have. It won't put any black hats out of business, won't stop any malware or hacking, and won't add a shred of value to the Internet. But it will make it harder for marketers and site builders to measure, understand and improve for their audience. The net impact will be a slightly worse web, and Google's claim of privacy will only protect them from criticism because it's a far easier explanation than the truth.
Sometimes, it sucks living in an ecosystem with an 800-pound gorilla.
p.s. Google's Matt Cutts responded to this post on Twitter today. I've included his comments and my replies below:

I remain somewhat skeptical that all the sites in Hubspot's data and ours would be outliers, but perhaps, at the least, this suggest the referral data disappearance won't get massively worse. Here's to hoping.
Three years ago in August 2009 we ran the first ever Young Rewired State – a hack weekend aimed at the young developer community. I was determined to try to engage them with the exciting (sic) world of open government data, and firing on all four cylinders went out to go tell those kids all about it.
But they were not there…
It made no sense to me that there was a thriving adult developer community, many of them of my own peer group, but no-one under the age of 18? Where were the kids? Was there a corner of the Internet I had yet to discover?
Over a period of months it became blindingly clear that there were no groups, there were tiny pockets and many isolated individuals – all teaching themselves how to code, driven by personal passion and nothing else.
We scraped together 50 of these kids from across the UK and it was one of the most incredible events we have ever run. Ask me about it and I will bore you to death with inspirational stories ![]()
Since then, running Young Rewired State has become the most important thing I do.
One story that I have heard time and time again, is that these genius kids are failing in ICT at school, because their teachers cannot mark their work. I mentioned this in the Guardian Tech Weekly Podcastand I am often asked to back up my claims!
One of the Young Rewired Staters who attended that first event (and every event Rewired State has run since regardless of the challenge – until he was snaffled by San Francisco: aged 16) explained this for the Coding for Kids google group, and I asked him if I could share his story here. Here goes:
When I was in year 10 (or 11, I can’t remember) we were given the brief to “design and create a multimedia product” for an assessment towards GCSE ICT.Most people opted to use powerpoint to create a sudo-multimedia product. I, however, decided to build a true multimedia product in Objective-C (a small game written for iPhone & iPod Touch which included a couple of videos, some story text, audio, it was an awesome little thing, it really was
The Powerpoints passed with flying colors, my project failed.I asked the head of IT why he failed me, he told me he simply couldn’t mark it. He had installed the app on his iPhone, as had the rest of the IT staff (Including the technicians who really loved it!), played it, but couldn’t mark it because a)He didn’t understand how it worked and b)It was leagues above anything else he’d ever seen from the class.I argued the case and managed to scrape a pass by teaching him the basics of Objective-C from scratch and by commenting every single line of code I wrote to explain exactly what it did and how it did it (all 3,400 lines, including standard libraries I used) which ended up being a huge time sink. Time, I was constantly aware, I could be relaxing or working on a project of my own.I understand that my case is a little different from the one involving Ruby, you can’t expect every IT teacher to be versed in iPhone development, but there is no excuse for not having at least a basic understanding of Ruby/Python and absolutely no excuse for failing work because its difficult to mark.This NEEDS to be fixed, so many fantastic young devs are becoming disillusioned with education because of little things like this. The thought process, for me at least, follows:“Wait a second, my IT teacher can’t mark this, so it fails? I don’t really want to be part of a system that works like this”.This is in stark contrast to events like YRS, where kids are encouraged to push the boundaries and explore how to do things differently to stunning effect. It was one of the major deciding factors for me to leave education and move to the US.The frightening thing is, after bringing it up at an event, almost every other young dev had a similar story.
I cannot tell you how sad I am that we have not been able to keep this YRSer in the UK, and this is one of the very many stories that drives me.
What can you do to help? Start by understanding this problem, then join groups like Coding for Kids and CAS of course – sign the petition.
There are a great many people trying to help solve this problem, and 2012 is certainly going to see a huge push towards solving this, but for now, just take some time to understand why this is such an important fight we have to win – for this generation and the next.
And as a PS, please read the introduction to Douglas Rushkoff‘s book: Program or be programmed – it is very good! (I so should be on commission from this guy).
Three years ago in August 2009 we ran the first ever Young Rewired State – a hack weekend aimed at the young developer community. I was determined to try to engage them with the exciting (sic) world of open government data, and firing on all four cylinders went out to go tell those kids all about it.
But they were not there…
It made no sense to me that there was a thriving adult developer community, many of them of my own peer group, but no-one under the age of 18? Where were the kids? Was there a corner of the Internet I had yet to discover?
Over a period of months it became blindingly clear that there were no groups, there were tiny pockets and many isolated individuals – all teaching themselves how to code, driven by personal passion and nothing else.
We scraped together 50 of these kids from across the UK and it was one of the most incredible events we have ever run. Ask me about it and I will bore you to death with inspirational stories ![]()
Since then, running Young Rewired State has become the most important thing I do.
One story that I have heard time and time again, is that these genius kids are failing in ICT at school, because their teachers cannot mark their work. I mentioned this in the Guardian Tech Weekly Podcastand I am often asked to back up my claims!
One of the Young Rewired Staters who attended that first event (and every event Rewired State has run since regardless of the challenge – until he was snaffled by San Francisco: aged 16) explained this for the Coding for Kids google group, and I asked him if I could share his story here. Here goes:
When I was in year 10 (or 11, I can’t remember) we were given the brief to “design and create a multimedia product” for an assessment towards GCSE ICT.Most people opted to use powerpoint to create a sudo-multimedia product. I, however, decided to build a true multimedia product in Objective-C (a small game written for iPhone & iPod Touch which included a couple of videos, some story text, audio, it was an awesome little thing, it really was
The Powerpoints passed with flying colors, my project failed.I asked the head of IT why he failed me, he told me he simply couldn’t mark it. He had installed the app on his iPhone, as had the rest of the IT staff (Including the technicians who really loved it!), played it, but couldn’t mark it because a)He didn’t understand how it worked and b)It was leagues above anything else he’d ever seen from the class.I argued the case and managed to scrape a pass by teaching him the basics of Objective-C from scratch and by commenting every single line of code I wrote to explain exactly what it did and how it did it (all 3,400 lines, including standard libraries I used) which ended up being a huge time sink. Time, I was constantly aware, I could be relaxing or working on a project of my own.I understand that my case is a little different from the one involving Ruby, you can’t expect every IT teacher to be versed in iPhone development, but there is no excuse for not having at least a basic understanding of Ruby/Python and absolutely no excuse for failing work because its difficult to mark.This NEEDS to be fixed, so many fantastic young devs are becoming disillusioned with education because of little things like this. The thought process, for me at least, follows:“Wait a second, my IT teacher can’t mark this, so it fails? I don’t really want to be part of a system that works like this”.This is in stark contrast to events like YRS, where kids are encouraged to push the boundaries and explore how to do things differently to stunning effect. It was one of the major deciding factors for me to leave education and move to the US.The frightening thing is, after bringing it up at an event, almost every other young dev had a similar story.
I cannot tell you how sad I am that we have not been able to keep this YRSer in the UK, and this is one of the very many stories that drives me.
What can you do to help? Start by understanding this problem, then join groups like Coding for Kids and CAS of course – sign the petition.
There are a great many people trying to help solve this problem, and 2012 is certainly going to see a huge push towards solving this, but for now, just take some time to understand why this is such an important fight we have to win – for this generation and the next.
And as a PS, please read the introduction to Douglas Rushkoff‘s book: Program or be programmed – it is very good! (I so should be on commission from this guy).
As you may have noticed recently, our engineers have been releasing a bunch of new products over the past few weeks. In addition to a vastly improved and cheaper option for adding CloudFlare SSL to domains, we've also made a few changes that may have been missed with the SSL changes.
What's new:
1. CloudFlare has updated the CloudFlare WordPress plugin to reflect new IP addresses we've added to the CloudFlare network. We're waiting on approval to get the new IP addresses updated in the CloudFlare Joomla extension as well.
2. CloudFlare added some new CloudFlare IP address ranges for both IpV4 and IpV6. Customers should make sure that they have updated mod_cloudflare to reflect these changes.
3. CloudFlare has recently launched a much improved version of CloudFlare invoices that you can find in your billing history.
4. CloudFlare's SSL now has three different SSL setting options.
5. A new CloudFlare datacenter in the United Kingdom.
6. Dome9 has been added to our growing list of app partners.
What's around the corner:
1. A new look to the overview page for 'My Websites'.
2. A new look to the CloudFlare 'Settings' page.
3. A new tier of billing service (stay tuned).
Have other suggestions about what we can do to improve the CloudFlare service? Send us a message with what you would like to see CloudFlare develop next.
So there I was, on stage in front of a large crowd, when Jason says "Has anyone seen the movie Apostle?" That's when I knew things were about to get interesting.
I was in Los Angeles on book tour a few weeks ago. The Los Angeles Lean Startup Circle arranged a spectacular event. I was interviewed - live on stage - by Jason Calacanis for This Week in Startups. The video became Episode #199, and you'll get to watch it below.
Jason's a controversial - and always entertaining - character. He's the founder and CEO of Mahalo, as well as the This Week In network. Oh and he also plays the occasional hand of televised high-stakes poker. So I really did not know what to expect when I met him on stage. For just about an hour, we had an in-depth interview, with Jason asking the kind of questions you only get from someone who has lived through the real highs and lows of entrepreneurship. I thought things were going well.
And then things took a pretty hilarious turn. Jason decides, on the spot, that we're going to have our very own revival meeting. In a full-on southern preacher accent, he invites entrepreneurs up on stage for some "hands on healing" as they share their real stories of problems, challenges, and obstacles in their startups. And, to my great surprise, people come forward. To be honest, I thought it was going to be a disaster, but I was wrong. The rest you have to watch for yourself.
Eric Ries of Lean Startup - TWiST #199
(The "Praise Jesus" starts at about 56 minutes in. Don't say I didn't warn you.)
I wanted to share this video with you, and not just because it is extremely entertaining. At the end of the session, I can tell that something is starting to bother Jason. We've been talking all along about pivots, vanity metrics, and validated learning. And I can see it start to dawn on him that, like the founders we've been "healing" all night, he has some questions about Mahalo that he wants answered.
And so we have a conversation, live on stage, about whether and how Mahalo should pivot from their current business (educational web videos) to a place where they're having unexpected success (paid iPad instructional video apps). A few days later, I noticed this in my newsfeed: Mahalo Lays Off 25 Percent for Shift to Apps From Video. And a few days after that, Mahalo got in touch to ask if I'd come into their studio to record a video instructional app based on The Lean Startup. After all that, how could I say no?
So if you'd like to see the next chapter in this story, you're cordially invited to a video shoot, which will take place next Monday, November 21, at Mahalo World HQ in Los Angeles. I'll be lecturing, we'll take questions from the audience, and - if anyone has the courage to come on stage - we'll even do some "hands-on healing" case studies with real entrepreneurs. Want to come? Sign up here.
See you Monday.
Posted by neilpatel
This post was originally in YOUmoz, and was promoted to the main blog because it provides great value and interest to our community. The author's views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
Last month I talked about 7 lessons I learned from running an SEO agency. One of the things I’ve learned over the years is that you can command clients to pay you 6 and even 7 figures a year if your agency has brand recognition.
Brand recognition also gets you more inbound inquires from companies who need SEO, which will help your revenue become more stable.
If you are ready to take your agency to the next level, here’s how you can build a big brand:
Start a blog
Not only can a blog help build your brand, it can also drive thousands of extra visitors to your website each month. Before you start a SEO blog, you need to know 1 important fact… it needs to be on the same domain as your agency website.
Early on I made the mistake of calling my SEO blog “Pronet Advertising”, when my SEO agency was in fact called “ACS”. The blog was hosted at pronetadvertising.com while the agency site was hosted at acsseo.com. Once the blog became popular people didn’t know whether we were called ACS or Pronet Advertising… even though on the blog it clearly stated that it was owned by ACS.
Once you have a blog up and running, there are a few things that you have to do if you want it to become popular. These days there are hundreds of SEO blogs and if you aren’t unique, there is no point in starting one.
- Write detailed blog posts – if you write how to type of posts you’ll typically gain more traction than if you just blog about SEO news.
- Use stats – blog posts that contain facts and percentages also tend to do better. If you look at the SEOmoz blog, you’ll notice that posts like this are popular because they contain stats.
- Dumb things down – SEO is a complicated topic, so if you can dumb down your blog posts you’re better off than if you only use technical jargon. At the end of the day the person who is most likely to hire and pay you won’t understand this jargon, if they did they’d just optimize their own site. By using images and even whiteboards, you’ll be able to dumb down your blog posts, yet still keep them somewhat technical.
If you blog on a weekly basis and push out good content every week, you’ll start to get more traffic and see an increase in brand recognition in the SEO space.
Become a guest author
Blogging on your own site isn’t enough. You need to start blogging on other blogs. By becoming a guest author on popular SEO and marketing blogs you’ll brand yourself as an SEO expert, which will also help your agency.
In addition to blogging on marketing blogs you also want to start blogging in places that your ideal customer may be looking. For example if your typical customer is a business owner, you could guest post on Entrepreneur.com. If they are an up and coming startup, TechCrunch would be great place for you to blog.
Every week you should be guest posting on at least 2 blogs. The first should be a marketing/SEO blog and the second should be a blog that your ideal customer is reading. To achieve this goal you should email at least 10 people every Sunday night that you are interested in guest posting for.
The reason you want to email 10 people is that most bloggers will say “no”. However, if you let them know the specific topic you want to blog on and you do the research to make sure these blogs accept guest posts, you should get at least 20% of them saying “yes”.
When you submit your guest post make sure you include an image of yourself as well as a bio that includes a link to your company. If you forget the bio you won’t be building up your agency’s brand.
Attend conferences
Conferences are a great place to brand your agency and pick up new customers. When I first started out I attended all of the major SEO conferences, and boy did I make a lot of money doing so. People got to know me, I met potential customers and I even gained more knowledge.
Over time SEO conferences became competitive, which caused my brand to be mixed in with hundreds of others SEOs. After a while I rarely was able to pick up new clients from them… even if I was speaking.
I started to look for new conferences to brand my agency. I quickly realized that industry specific conferences and regional based events aren’t crowded with a ton of SEOs. Which makes it easier to pick up new clients.
If you want pick up new clients start attending industry specific events that aren’t SEO related and also focus on attending regional based events.
If you want to start building your brand through conferences you have to attend dozens of them each year as some conferences will work out while others won’t. In most cases the events that cost a lot tend to provide a better ROI because if someone can afford to pay $3000 to attend a conference, they can probably afford to hire your firm.
Create white papers and beginner’s guides
Everyone these days is releasing free information because it really helps with branding. Just look at SEOmoz’s guide to SEO. It’s become the default guide if you want to learn about the subject.
It’s so popular that when people ask other SEOs about learning SEO, they just point them to SEOmoz’s guide.
By no means is this a short-term branding strategy, but if you can produce really good white papers and beginner’s guides you should start seeing them passed around. The trick is to just make them downloadable so that people can pass them around within their organization.
If you are going to create content, make sure you don’t just publish the same information that others have already released. For example it would be a waste of your time to reproduce the Beginner’s Guide to SEO.
Buy banner ads
Banner ads? They are a waste of money, right? Well if you buy them on other SEO sites they aren’t. When I ran my agency we would spend up to $15,000 a month buying banner ads on other SEO sites...these sites weren’t just news related like Sphinn.
We would actually buy ads on our competitors’ site. As weird as that may sound, other SEO bloggers, who are also service providers, sell ads. From the Wolf Howl blog to Search Engine Roundtable, you can buy ads on some of the most popular SEO sites on the web.
When I bought ads I found that most of them didn’t convert when I sent them to my homepage, but when I sent them to a landing page that gave away a free white paper, it was very effective. As your sales team can take those leads and close them.
Participate in the community
The cool part about the SEO industry is that SEOs have a tendency to work together. They play nice with each other and they are willing to share customers. For this reason it’s important to build up your brand amongst other SEOs.
Other than going to SEO networking events, you can do this through 2 simple ways:
- Comment on 5 different SEO blogs each day. Don’t leave generic comments; instead leave very detailed comments that are adding to the conversation.
- Join popular SEO forums like SEO Chat and respond to 5 different questions each day. Similar to leaving comments on blogs, make sure your responses on forums are detailed.
If you do these two things for a year, you’ll end up building a large Rolodex of SEOs.
Help people out for free
The most effective branding strategy I used with my firm was to help out popular bloggers for free with their SEO. In exchange for helping them, they would add an image link with my company logo that stated “SEO by ACS”. When you do this for 30 of the Technorati 100 blogs like I did, you’ll quickly build up your brand as these blogs get millions of visitors each month.
You don’t have to know these blogs to do their SEO, just shoot them an emailing offering free SEO services in exchange for an image link. Trust me, it works.
And think of it this way, if someone has that popular of a blog, the chances are they know at least 1 company who can afford to pay you at least $10,000 a month for your consulting services. If you help these bloggers increase their traffic you should be able to get a few introductions to companies that can pay you.
Help people out for free
Another effective tactic that I used to build up my agency’s brand was to just help out people for free. You’ll notice that over time a lot of people and companies who can’t afford your services will hit you up. Instead of turning them down, give them some free advice every once in a while. You’ll be shocked on what that will do for your brand.
My first million-dollar customer came from someone who had no money. I helped him out for free for 30 minutes at a conference and he continually told everyone at the event how I was a great SEO. Within hours I had businesses hitting me up and a few of them were interested in engaging with my firm for 1.2 million dollars a year. Out of those few that were interested 1 agreed.
So the next time someone emails you asking for free advice, don’t push them away. Offer some free help, as they’ll constantly tell other people good things about you and your agency.
Create case studies
If you are good at what you do, you should be able to create some awesome case studies. All you have to do is breakdown what you specifically did for that company, how long it took, and the exact results they saw.
When you breakdown the results be sure to include numbers such as traffic stats, revenue increases or anything else the client will let you include. And when wrapping up the case study, make sure you get a good testimonial or video interview from the client.
One agency that does this very well (they aren’t an SEO agency) is Conversion Rate Experts. I found out about them through their SEOmoz case study. The video Rand did with them broke down how they made SEOmoz over a million dollars, and convinced me to hire them.
Showcase these case studies all over your website as they will help get your agency out there in front of bigger companies. Without them it’s going to be tough to gain interest from Fortune 500 companies.
Conclusion
If you do everything I mentioned above, you’ll brand your agency as one of the best shops around. I know it’s a lot work, but in the end the revenue will make up for the hard work.
Give it a shot; try out just a few of the tactics above for the next 6 months and I guarantee that you’ll see results. You have to make sure you keep at it, as you can’t expect miracles in the first few months of doing this.
About the author:Neil Patel is the co-founder of KISSmetrics, an analytics provider that helps companies make better business decisions.
The United States House of Representatives is considering the Stop Online Piracy Act, known as SOPA. Companies including Google, Zynga, Facebook, Yahoo, AOL, and Mozilla, along with organizations like the Electronic Frontier Foundation (EFF) have been sharply critical of the law. At CloudFlare, we share these concerns but see another significant risk: that SOPA's proposed restrictions could be used to launch a new form of denial of service attack against which I'm not sure we will be able to defend.
The Status Quo
There is no denying that the Internet creates new challenges for content creators. We see this first hand. CloudFlare's users are content creators. Every day they publish unique content and are deeply concerned when that content is used without their permission. We spend significant time building technologies, such as tools to prevent content scraping bots, in order to help publishers keep their content from being stolen.
At CloudFlare we also receive requests from content owners alleging one of our users has published their content without their permission. While CloudFlare is not a hosting provider, we do sit as a network provider in front of websites in order to make them faster and shield them from attack. The Digital Millennium Copyright Act, known as the DMCA, contemplates network providers like CloudFlare and generally outlines the procedures we take to reveal the actual host of a website when we are contacted by a copyright holder with a valid complaint.
Abusing the DMCA
We've been seeing a disturbing trend recently. Increasingly, we're receiving purported DMCA requests that ask us to identify website hosts that are actually from attackers abusing the legal code. If we reveal the requested information, attacks are launched directly at those hosts, bypassing CloudFlare's protections and knocking legitimate sites offline. Initially, these requests were relatively easy to spot. When we recognized the new attack method, we changed our policies and trained our customer support team to more carefully screen DMCA requests. Increasingly, however, the requests are becoming more sophisticated and difficult to detect.
Imagine the challenge for someone on CloudFlare's support team. If someone writes to us alleging that they are a photographer who took a picture that appears on a website, or a designer who drew a logo, or an author who wrote some text, how can that claim be verified? I'm an attorney and member of the bar. I teach a course on intellectual property and technology law at the John Marshall Law School. I serve on the Board of the Center for Information Technology and Privacy Law. I've reviewed many of these requests and, even with my training in the subject, I have no idea how to effectively and efficiently tell the difference between valid and invalid complaints.
In an Internet without bad guys, the consequences of revealing a host's information is relatively minimal. Unfortunately, the Internet is full of bad guys. There has been a steady rise in attacks, increasingly affecting legitimate small businesses and ecommerce sites. These attacks have been part of why more than 100,000 websites have sought shelter behind CloudFlare in just the last 12 months. We offer great technical protections to shield sites from attack, but I'm concerned some of our efforts could be undermined by new laws like SOPA.
SOPA: Enabling a Purely Legal DDoS
CloudFlare's policy under the DMCA is to reveal information about the origin host when we receive a valid copyright complaint. If we make a mistake and reveal the origin host to a bad guy, then the bad guy still needs the technical acumen to launch a DDoS attack. What's concerning to me about SOPA is it could remove the technical requirement and effectively streamline DDoS attacks.
SOPA, as it is currently written, requires network service providers like CloudFlare to stop resolving DNS for sites that are alleged copyright violators. The allegation merely needs to include some reasonable evidence. In other words, a carefully crafted letter, or forged subpoena, could be all it takes for a future attacker to knock a site offline. No botnet needed, just a passable mastery of legalese.
While it is important to acknowledge the need for copyright protections online and to provide systems to protect content creators, new laws designed to uphold those protections need to be carefully crafted so as to not create substantial new security risks. Writing bad computer code has always provided a vector for attacks. I'm increasingly concerned that writing bad legal code, like SOPA, will provide a similar vector.
If you're in the US, follow this link to the EFF's site. From there, it takes less than a minute to send a message to your legislators to tell them SOPA is a bad idea.
Posted by Dr. Pete
“No one saw the panda uprising coming. One day, they were frolicking in our zoos. The next, they were frolicking in our entrails. They came for the identical twins first, then the gingers, and then the rest of us. I finally trapped one and asked him the question burning in all of our souls – 'Why?!' He just smiled and said ‘You humans all look alike to me.’”
- Sgt. Jericho “Bamboo” Jackson
Ok, maybe we’re starting to get a bit melodramatic about this whole Panda thing. While it’s true that Panda didn’t change everything about SEO, I think it has been a wake-up call about SEO issues we’ve been ignoring for too long.
One of those issues is duplicate content. While duplicate content as an SEO problem has been around for years, the way Google handles it has evolved dramatically and seems to only get more complicated with every update. Panda has upped the ante even more.
So, I thought it was a good time to cover the topic of duplicate content, as it stands in 2011, in depth. This is designed to be a comprehensive resource – a complete discussion of what duplicate content is, how it happens, how to diagnose it, and how to fix it. Maybe we’ll even round up a few rogue pandas along the way.
I. What Is Duplicate Content?
Let’s start with the basics. Duplicate content exists when any two (or more) pages share the same content. If you’re a visual learner, here’s an illustration for you:

Easy enough, right? So, why does such a simple concept cause so much difficulty? One problem is that people often make the mistake of thinking that a “page” is a file or document sitting on their web server. To a crawler (like Googlebot), a page is any unique URL it happens to find, usually through internal or external links. Especially on large, dynamic sites, creating two URLs that land on the same content is surprisingly easy (and often unintentional).
II. Why Do Duplicates Matter?
Duplicate content as an SEO issue was around long before the Panda update, and has taken many forms as the algorithm has changed. Here’s a brief look at some major issues with duplicate content over the years…
The Supplemental Index
In the early days of Google, just indexing the web was a massive computational challenge. To deal with this challenge, some pages that were seen as duplicates or just very low quality were stored in a secondary index called the “supplemental” index. These pages automatically became 2nd-class citizens, from an SEO perspective, and lost any competitive ranking ability.
Around late 2006, Google integrated supplemental results back into the main index, but those results were still often filtered out. You know you’ve hit filtered results anytime you see this warning at the bottom of a Google SERP:

Even though the index was unified, results were still “omitted”, with obvious consequences for SEO. Of course, in many cases, these pages really were duplicates or had very little search value, and the practical SEO impact was negligible, but not always.
The Crawl “Budget”
It’s always tough to talk limits when it comes to Google, because people want to hear an absolute number. There is no absolute crawl budget or fixed number of pages that Google will crawl on a site. There is, however, a point at which Google may give up crawling your site for a while, especially if you keep sending spiders down winding paths.
Although the “budget” isn’t absolute, even for a given site, you can get a sense of Google’s crawl allocation for your site in Google Webmaster Tools (under “Diagnostics” > “Crawl Stats”):

So, what happens when Google hits so many duplicate paths and pages that it gives up for the day? Practically, the pages you want indexed may not get crawled. At best, they probably won’t be crawled as often.
The Indexation “Cap”
Similarly, there’s no set “cap” to how many pages of a site Google will index. There does seem to be a dynamic limit, though, and that limit is relative to the authority of the site. If you fill up your index with useless, duplicate pages, you may push out more important, deeper pages. For example, if you load up on 1000s of internal search results, Google may not index all of your product pages. Many people make the mistake of thinking that more indexed pages is better. I’ve seen too many situations where the opposite was true. All else being equal, bloated indexes dilute your ranking ability.
The Penalty Debate
Long before Panda, a debate would erupt every few months over whether or not there was a duplicate content penalty. While these debates raised valid points, they often focused on semantics – whether or not duplicate content caused a Capital-P Penalty. While I think the conceptual difference between penalties and filters is important, the upshot for a site owner is often the same. If a page isn’t ranking (or even indexed) because of duplicate content, then you’ve got a problem, no matter what you call it.
The Panda Update
Since Panda (starting in February 2011), the impact of duplicate content has become much more severe in some cases. It used to be that duplicate content could only harm that content itself. If you had a duplicate, it might go supplemental or get filtered out. Usually, that was ok. In extreme cases, a large number of duplicates could bloat your index or cause crawl problems and start impacting other pages.
Panda made duplicate content part of a broader quality equation – now, a duplicate content problem can impact your entire site. If you’re hit by Panda, non-duplicate pages may lose ranking power, stop ranking altogether, or even fall out of the index. Duplicate content is no longer an isolated problem.
III. Three Kinds of Duplicates
Before we dive into examples of duplicate content and the tools for dealing with them, I’d like to cover 3 broad categories of duplicates. They are: (1) True Duplicates, (2) Near Duplicates, and (3) Cross-domain Duplicates. I’ll be referencing these 3 main types in the examples later in the post.
(1) True Duplicates
A true duplicate is any page that is 100% identical (in content) to another page. These pages only differ by the URL:

(2) Near Duplicates
A near duplicate differs from another page (or pages) by a very small amount – it could be a block of text, an image, or even the order of the content:

An exact definition of “near” is tough to pin down, but I’ll discuss some examples in detail later.
(3) Cross-domain Duplicates
A cross-domain duplicate occurs when two websites share the same piece of content:

These duplicates could be either “true” or “near” duplicates. Contrary to what some people believe, cross-domain duplicates can be a problem even for legitimate, syndicated content.
IV. Tools for Fixing Duplicates
This may seem out of order, but I want to discuss the tools for dealing with duplicates before I dive into specific examples. That way, I can recommend the appropriate tools to fix each example without confusing anyone.
(1) 404 (Not Found)
Of course, the simplest way to deal with duplicate content is to just remove it and return a 404 error. If the content really has no value to visitors or search, and if it has no significant inbound links or traffic, then total removal is a perfectly valid option.
(2) 301 Redirect
Another way to remove a page is via a 301-redirect. Unlike a 404, the 301 tells visitors (humans and bots) that the page has permanently moved to another location. Human visitors seamlessly arrive at the new page. From an SEO perspective, most of the inbound link authority is also passed to the new page. If your duplicate content has a clear canonical URL, andthe duplicate has traffic or inbound links, then a 301-redirect may be a good option.
(3) Robots.txt
Another option is to leave the duplicate content available for human visitors, but block it for search crawlers. The oldest and probably still easiest way to do this is with a robots.txt file (generally located in your root directory). It looks something like this:

One advantage of robots.txt is that it’s relatively easy to block entire folders or even URL parameters. The disadvantage is that it’s an extreme and sometimes unreliable solution. While robots.txt is effective for blocking uncrawled content, it’s not great for removing content already in the index. The major search engines also seem to frown on its overuse, and don’t generally recommend robots.txt for duplicate content.
(4) Meta Robots
You can also control the behavior of search bots at the page level, with a header-level directive known as the “Meta Robots” tag (or sometimes “Meta Noindex”). In its simplest form, the tag looks something like this:

This directive tells search bots not to index this particular page or follow links on it. Anecdotally, I find it a bit more SEO-friendly than Robots.txt, and because the tag can be created dynamically with code, it can often be more flexible.
The other common variant for Meta Robots is the content value “NOINDEX, FOLLOW”, which allows bots to crawl the paths on the page without adding the page to the search index. This can be useful for pages like internal search results, where you may want to block certain variations (I’ll discuss this more later) but still follow the paths to product pages.
One quick note: there is no need to ever add a Meta Robots tag with “INDEX, FOLLOW” to a page. All pages are indexed and followed by default (unless blocked by other means).
(5) Rel=Canonical
In 2009, the search engines banded together to create the Rel=Canonical directive, sometimes called just “Rel-canonical” or the “Canonical Tag”. This allows webmasters to specify a canonical version for any page. The tag goes in the page header (like Meta Robots), and a simple example looks like this:

When search engines arrive on a page with a canonical tag, they attribute the page to the canonical URL, regardless of the URL they used to reach the page. So, for example, if a bot reached the above page using the URL “www.example.com/index.html”, the search engine would notindex the additional, non-canonical URL. Typically, it seems that inbound link-juice is also passed through the canonical tag.
It’s important to note that you need to clearly understand what the proper canonical page is for any given website template. Canonicalizing your entire site to just one page or the wrong pages can be catastrophic.
(6) Google URL Removal
In Google Webmaster Tools (GWT), you can request that an individual page (or directory) be manually removed from the index. Click on “Site configuration” > “Crawler access”, and you’ll see a series of 3 tabs. Click on the 3rd tab, “Remove URL”, to get this:

Since this tool only removes one URL or path at a time and is completely at Google’s discretion, it’s usually a last-ditch approach to duplicate content. I just want to be thorough, though, and cover all of your options. An important technical note: you need to 404, Robots.txt block or Meta Noindex the page before requesting removal. Removal via GWT is primarily a last defense when Google is being stubborn.
Update: In the comments, Taylor pointed out that Google lifted the requirement that you have to first block the page to request removal. Removal requests can be done without blocking via other means now, but the removals only last 90 days.
(7) Google Parameter Blocking
You can also use GWT to specify URL parameters that you want Google to ignore (which essentially blocks indexation of pages with those parameters). If you click on “Site Configuration” > “URL parameters”, you’ll get a list something like this:

This list shows URL parameters that Google has detected, as well as the settings for how those parameters should be crawled. Keep in mind that the “Let Googlebot decide” setting doesn’t reflect other blocking tactics, like Robots.txt or Meta Robots. If you click on “Edit”, you’ll get the following options:

Google changed these recently, and I find the new version a bit confusing, but essentially “Yes” means the parameter is important and should be indexed, while “No” means the parameter indicates a duplicate. The GWT tool seems to be effective (and can be fast), but I don’t usually recommend it as a first line of defense. It won’t impact other search engines, and it can’t be read by SEO tools and monitoring software. It could also be modified by Google at any time.
(8) Bing URL Removal
Bing Webmaster Center (BWC) has tools very similar to GWT’s options above. Actually, I think the Bing parameter blocking tool came before Google’s version. To request a URL removal in Bing, click on the “Index” tab and then “Block URLs” > “Block URL and Cache”. You’ll get a pop-up like this:

BWC actually gives you a wider range of options, including blocking a directory and your entire site. Obviously, that last one usually isn’t a good idea.
(9) Bing Parameter Blocking
In the same section of BWC (“Index”), there’s an option called “URL Normalization”. The name implies Bing treats this more like canonicalization, but there’s only one option – “ignore”. Like Google, you get a list of auto-detected parameters and can add or modify them:

As with the GWT tools, I’d consider the Bing versions to be a last resort. Generally, I’d only use these tools if other methods have failed, and one search engine is just giving you grief.
(10) Rel=Prev & Rel=Next
Just this year (September 2011), Google gave us a new tool for fighting a particular form of near-duplicate content – paginated search results. I’ll describe the problem in more detail in the next section, but essentially paginated results are any searches where the results are broken up into chunks, with each chunk (say, 10 results) having its own page/URL.
You can now tell Google how paginated content connects by using a pair of tags much like Rel-Canonical. They’re called Rel-Prev and Rel-Next. Implementation is a bit tricky, but here’s a simple example:

In this example, the search bot has landed on page 3 of search results, so you need two tags: (1) a Rel-Prev pointing to page 2, and (2) a Rel-Next pointing to page 4. Where it gets tricky is that you’re almost always going to have to generate these tags dynamically, as your search results are probably driven by one template.
While initial results suggest these tags do work, they’re not currently honored by Bing, and we really don’t have much data on their effectiveness. I’ll briefly discuss other methods for dealing with paginated content in the next section.
(11) Syndication-Source
In November of 2010, Google introduced a set of tags for publishers of syndicated content. The Meta Syndication-Source directive can be used to indicate the original source of a republished article, as follows:

Even Google’s own advice on when to use this tag and when to use a cross-domain canonical tag are a little bit unclear. Google launched this tag as “experimental”, and I’m not sure they’ve publicly announced a status change. It’s something to watch, but don’t rely on it.
Update (11/21/11): For even more confusion, Google has recently added the "standout" tag. This is supposed to be used when you break a news story, but the interplay between it and syndication-source is unclear. Again, I wouldn't rely on these tags for now. Thanks to SEO Workers for pointing this out in the comments.
(12) Internal Linking
It’s important to remember that your best tool for dealing with duplicate content is to not create it in the first place. Granted, that’s not always possible, but if you find yourself having to patch dozens of problems, you may need to re-examine your internal linking structure and site architecture.
When you do correct a duplication problem, such as with a 301-redirect or the canonical tag, it’s also important to make your other site cues reflect that change. It’s amazing how often I see someone set a 301 or canonical to one version of a page, and then continue to link internally to the non-canonical version and fill their XML sitemap with non-canonical URLs. Internal links are strong signals, and sending mixed signals will only cause you problems.
(13) Don’t Do Anything
Finally, you can let the search engines sort it out. This is what Google recommended you do for years, actually. Unfortunately, in my experience, especially for large sites, this is almost always a bad idea. It’s important to note, though, that not all duplicate content is a disaster, and Google certainly can filter some of it out without huge consequences. If you only have a few isolated duplicates floating around, leaving them alone is a perfectly valid option.
V. Examples of Duplicate Content
So, now that we’ve worked backwards and sorted out the tools for fixing duplicate content, what does it actually look like in the wild? I’m going to cover a wide range of examples that represent the issues you can expect on a real website. Throughout this section, I’ll reference the solutions listed in Section IV – for example, a reference to a 301-redirect will cite (IV-2).
(1) “www” vs. Non-www
For sitewide duplicate content, this is probably the biggest culprit. Whether you’ve got bad internal paths or have attracted links and social mentions to the wrong URL, you’ve got both the”www” version and non-www (root domain) version of your URLs indexed:

Most of the time, a 301-redirect (IV-2) is your best choice here. This is a common problem, and Google is good about honoring redirects for cases like these.
You may also want to set your preferred address in Google Webmaster Tools. Under “Site Configuration” > “Settings”, you should see a section called “Preferred domain”:

There’s a quirk in GWT where, to set a preferred domain, you may have to create GWT profiles for both your “www” and non-www versions of the site. While this is annoying, it won’t cause any harm. If you’re having major canonicalization issues, I’d recommend it. If you’re not, then you can leave well enough alone and let Google determine the preferred domain.
(2) Staging Servers
While much less common than (1), this problem is often also caused by subdomains. In a typical scenario, you’re working on a new site design for a relaunch, your dev team sets up a subdomain with the new site, and they accidentally leave it open to crawlers. What you end up with is two sets of indexed URLS that look something like this:

Your best bet is to prevent this problem before it happens, by blocking the staging site with Robots.txt (IV-3). If you find your staging site indexed, though, you’ll probably need to 301-redirect (IV-2) those pages or Meta Noindex them (IV-4).
(3) Trailing Slashes ("/")
This is a problem people often have questions about, although it's less of an SEO issue than it once was. Technically, in the original HTTP protocol, a URL with a trailing slash and one without it were different URLs. Here's a simple example:

These days, almost all browsers automatically add the trailing slash behind the scenes and resolve both versions the same way. Matt Cutts did a recent video suggesting that Google automatically canonicalizes these URLs in "the vast majority of cases".
(4) Secure (https) Pages
If your site has secure pages (designated by the “https:” protocol), you may find that both secure and non-secure versions are getting indexed. This most frequently happens when navigation links from secure pages – like shopping cart pages – also end up secured, usually due to relative paths, creating variants like this:

Ideally, these problems are solved by the site-architecture itself. In many cases, it’s best to Noindex (IV-4) secure pages – shopping cart and check-out pages have no place in the search index. After the fact, though, your best option is a 301-redirect (IV-2). Be cautious with any sitewide solutions – if you 301-redirect all “https:” pages to their “http:” versions, you could end up removing security entirely. This is a tricky problem to solve and should be handled carefully.
(5) Home-page Duplicates
While problems (1)-(3) can all create home-page duplicates, the home-page has a couple unique problems of its own. The most typical problem is that both the root domain and the actual home-page document name get indexed. For example:

Although this problem can be solved with a 301-redirect (IV-2), it’s often a good idea to put a canonical tag on your home-page (IV-5). Home pages are uniquely afflicted by duplicates, and a proactive canonical tag can prevent a lot of problems.
Of course, it’s important to also be consistent with your internal paths (IV-12). If you want the root version of the URL to be canonical, but then link to “/index.htm” in your navigation, you’re sending mixed signals to Google every time the crawlers visit.
(6) Session IDs
Some websites (especially e-commerce platforms) tag each new visitor with a tracking parameter. On occasion, that parameter ends up in the URL and gets indexed, creating something like this:

That image really doesn’t do the problem justice, because in reality you can end up with a duplicate for every single session ID and page combination that gets indexed. Session IDs in the URL can easily add 1000s of duplicate pages to your index.
The best option, if possible on your site/platform, is to remove the session ID from the URL altogether and store it in a cookie. There are very few good reasons to create these URLs, and no reason to let bots crawl them. If that’s not feasible, implementing the canonical tag (IV-5) sitewide is a good bet. If you really get stuck, you can block the parameter in Google Webmaster Tools (IV-7) and Bing Webmaster Central (IV-9).
(7) Affiliate Tracking
This problem looks a lot like (6) and happens when sites provide a tracking variable to their affiliates. This variable is typically appended to landing page URLs, like so:

The damage is usually a bit less extreme than (5), but it can still cause large-scale duplication. The solutions are similar to session IDs. Ideally, you can capture the affiliate ID in a cookie and 301-redirect (IV-3) to the canonical version of the page. Otherwise, you’ll probably either need to use canonical tags (IV-5) or block the affiliate URL parameter.
(8) Duplicate Paths
Having duplicate paths to a page is perfectly fine, but when duplicate paths generate duplicate URLs, then you’ve got a problem. Let’s say a product page can be reached one of 3 ways:

Here, the iPad2 product page can be reached by 2 categories and a user-generated tag. User-generated tags are especially problematic, because they can theoretically spawn unlimited versions of a page.
Ideally, these path-based URLs shouldn’t be created at all. However a page is navigated to, it should only have one URL for SEO purposes. Some will argue that including navigation paths in the URL is a positive cue for site visitors, but even as someone with a usability background, I think the cons almost always outweigh the pros here.
If you already have variations indexed, then a 301-redirect (IV-2) or canonical tag (IV-5) are probably your best options. In many cases, implementing the canonical tag will be easier, since there may be too many variations to easily redirect. Long-term, though, you’ll need to re-evaluate your site architecture.
(9) Functional Parameters
Functional parameters are URL parameters that change a page slightly but have no value for search and are essentially duplicates. For example, let’s say that all of your product pages have a printable version, and that version has its own URL:

Here, the “print=1” URL variable indicates a printable version, which normally would have the same content but a modified template. Your best bet is to not index these at all, with something like a Meta Noindex (IV-4), but you could also use a canonical tag (IV-5) to consolidate these pages.
(10) International Duplicates
These duplicates occur when you have content for different countries which share the same language, all hosted on the same root domain (it could be subfolders or subdomains). For example, you may have an English version of your product pages for the US, UK, and Australia:

Unfortunately, this one’s a bit tough – in some cases, Google will handle it perfectly well and rank the appropriate content in the appropriate countries. In other cases, even with proper geo-targeting, they won’t. It’s often better to target the language itself than the country, but there are legitimate reasons to split off country-specific content, such as pricing.
If your international content does get treated as duplicate content, there’s no easy answer. If you 301-redirect, you lose the page for visitors. If you use the canonical tag, then Google will only rank one version of the page. The “right” solution can be highly situational and really depends on the risk-reward tradeoff (and the scope of the filter/penalty).
(11) Search Sorts
So far, all of the examples I’ve given have been true duplicates. I’d like to dive into a few examples of “near” duplicates, since that concept is a bit fuzzy. A few common examples pop up with internal search engines, which tend to spin off many variants – sortable results, filters, and paginated results being the most frequent problems.
Search sort duplicates pop up whenever a sort (ascending/descending) creates a separate URL. While the two sorted results are technically different pages, they add no additional value to the search index and contain the same content, just in a different order. URLs might look like:

In most cases, it’s best just to block the sortable versions completely, usually by adding a Meta Noindex (IV-4) selectively to pages called with that parameter. In a pinch, you could block the sort parameter in Google Webmaster Tools (IV-7) and Bing Webmaster Central (IV-9).
(12) Search Filters
Search filters are used to narrow an internal search – it could be price, color, features, etc. Filters are very common on e-commerce sites that sell a wide variety of products. Search filter URLs look a lot like search sorts, in many cases:

The solution here is similar to (11) – don’t index the filters. As long as Google has a clear path to products, indexing every variant usually causes more harm than good.
(13) Search Pagination
Pagination is an easy problem to describe and an incredibly difficult one to solve. Any time you split internal search results into separate pages, you have paginated content. The URLs are easy enough to visualize:

Of course, over 100s of results, one search can easily spin out dozens of near duplicates. While the results themselves differ, many important features of the pages (Titles, Meta Descriptions, Headers, copy, template, etc.) are identical. Add to that the problem that Google isn’t a big fan of “search within search” (having their search pages land on yours).
In the past, Google has said to let them sort pagination out – problem is, they haven’t done it very well. Recently, Google introduced Rel=Prev and Rel=Next (IV-10). Initial data suggests these tags work, but we don’t have much data, they’re difficult to implement, and Bing doesn’t currently support them.
You have 3 other, viable options (in my opinion), although how and when they’re viable depends a lot on the situation:
- You can Meta Noindex,Follow pages 2+ of search results. Let Google crawl the paginated content but don’t let them index it.
- You can create a “View All” page that links to all search results at one URL, and let Google auto-detect it. This seems to be Google’s other preferred option.
- You can create a “View All” page and set the canonical tag of paginated results back to that page. This is unofficially endorsed, but the pages aren’t really duplicates in the traditional sense, so some claim it violates the intent of Rel-canonical.
Adam Audette has a recent, in-depth discussion of search pagination that I highly recommend. Pagination for SEO is a very difficult topic and well beyond the scope of this post.
(14) Product Variations
Product variant pages are pages that branch off from the main product page and only differ by one feature or option. For example, you might have a page for each color a product comes in:

It can be tempting to want to index every color variation, hoping it pops up in search results, but in most cases I think the cons outweigh the pros. If you have a handful of product variations and are talking about dozens of pages, fine. If product variations spin out into 100s or 1000s, though, it’s best to consolidate. Although these pages aren’t technically true duplicates, I think it’s ok to Rel-canonical (IV-5) the options back up to the main product page.
One site note: I purposely used “static” URLs in this example to demonstrate a point. Just because a URL doesn’t have parameters, that doesn’t make it immune to duplication. Static URLs (parameter-free) may look prettier, but they can be duplicates just as easily as dynamic URLs.
(15) Geo-keyword Variations
Once upon a time, “local SEO” meant just copying all of your pages 100s of times, adding a city name to the URL, and swapping out that city in the page copy. It created URLs like these:

In 2011, not only is local SEO a lot more sophisticated, but these pages are almost always going to look like near-duplicates. If you have any chance of ranking, you’re going to need to invest in legitimate, unique content for every geographic region you spin out. If you aren’t willing to make that investment, then don’t create the pages. They’ll probably backfire.
(16) Other “Thin” Content
This isn’t really an example, but I wanted to stop and explain a word we throw around a lot when it comes to content: “thin”. While thin content can mean a variety of things, I think many examples of thin content are near-duplicates like (14) above. Whenever you have pages that vary by only a tiny percentage of content, you risk those pages looking low-value to Google. If those pages are heavy on ads (with more ads than unique content), you’re at even more risk. When too much of your site is thin, it’s time to revisit your content strategy.
(17) Syndicated Content
These last 3 examples all relate to cross-domain content. Here, the URLs don’t really matter – they could be wildly different. Examples (17) and (18) only differ by intent. Syndicated content is any content you use with permission from another site. However you retrieve and integrate it, that content is available on another site (and, often, many sites).
While syndication is legitimate, it’s still likely that one or more copies will get filtered out of search results. You could roll the dice and see what happens (IV-13), but conventional SEO wisdom says that you should link back to the source and probably set up a cross-domain canonical tag (IV-5). A cross-domain canonical looks just like a regular canonical, but with a reference to someone else’s domain.
Of course, a cross-domain canonical tag means that, assuming Google honors the tag, your page won’t get indexed or rank. In some cases, that’s fine – you’re using the content for its value to visitors. Practically, I think it depends on the scope. If you occasionally syndicate content to beef up your own offerings but also have plenty of unique material, then link back and leave it alone. If a larger part of your site is syndicated content, then you could find yourself running into trouble. Unfortunately, using the canonical tag (IV-5) means you'll lose the ranking ability of that content, but it could keep you from getting penalized or having Panda-related problems.
(18) Scraped Content
Scraped content is just like syndicated content, except that you didn’t ask permission (and might even be breaking the law). The best solution: QUIT BREAKING THE LAW!
Seriously, no de-duping solution is going to satisfy the scrapers among you, because most solutions will knock your content out of ranking contention. The best you can do is pad the scraped content with as much of your own, unique content as possible.
(19) Cross-ccTLD Duplicates
Finally, it’s possible to run into trouble when you copy same-language content across countries – see example (9) above – even with separate Top-Level Domains (TLDs). Fortunately, this problem is fairly rare, but we see it with English-language content and even with some European languages. For example, I frequently see questions about Dutch content on Dutch and Belgian domains ranking improperly.
Unfortunately, there’s no easy answer here, and most of the solutions aren’t traditional duplicate-content approaches. In most cases, you need to work on your targeting factors and clearly show Google that the domain is tied to the country in question.
VI. Which URL Is Canonical?
I’d like to take a quick detour to discuss an important question – whether you use a 301-redirect or a canonical tag, how do you know which URL is actually canonical? I often see people making a mistake like this:

The problem is that “product.php” is just a template – you’ve now collapsed all of your products down to a single page (that probably doesn’t even display a product). In this case, the canonical version probably includes a parameter, like “id=1234”.
The canonical page isn’t always the simplest version of the URL – it’s the simplest version of the URL that generates UNIQUE content. Let’s say you have these 3 URLs that all generate the same product page:

Two of these versions are essentially duplicates, and the “print” and “session” parameters represent variations on the main product page that should be de-duped. The “id” parameter is essential to the content, though – it determines which product is actually being displayed.
So, consider yourself warned. As much trouble as rampant duplicates can be, bad canonicalization can cause even more damage in some cases. Plan carefully, and make absolutely sure you select the correct canonical versions of your pages before consolidating them.
VII. Tools for Diagnosing Duplicates
So, now that you recognize what duplicate content looks like, how do you go about finding it on your own site? Here are a few tools to get you started – I won’t claim it’s a complete list, but it covers the bases:
(1) Google Webmaster Tools
In Google Webmaster Tools, you can pull up a list of duplicate TITLE tags and Meta Descriptions Google has crawled. While these don’t tell the whole story, they’re a good starting point. Many URL-based duplicates will naturally generate identical Meta data. In your GWT account, go to “Diagnostics” > “HTML Suggestions”, and you’ll see a table like this:

You can click on “Duplicate meta descriptions” and “Duplicate title tags” to pull up a list of the duplicates. This is a great first stop for finding your trouble-spots.
(2) Google’s Site: Command
When you already have a sense of where you might be running into trouble and need to take a deeper dive, Google’s “site:” command is a very powerful and flexible tool. What really makes “site:” powerful is that you can use it in conjunction with other search operators.
Let’s say, for example, that you’re worried about home-page duplicates. To find out if Google has indexed any copies of your home-page, you could use the “site:” command with the “intitle:” operator, like this:

Put the title in quotes to capture the full phrase, and always use the root domain (leave off “www”) when making a wide sweep for duplicate content. This will detect both “www” and non-www versions.
Another powerful combination is “site:” plus the “inurl:” operator. You could use this to detect parameters, such as the search-sort problem mentioned above:

The “inurl:” operator can also detect the protocol used, which is handy for finding out whether any secure (https:) copies of your pages have been indexed:

You can also combine the “site:” operator with regular search text, to find near-duplicates (such as blocks of repeated content). To search for a block of content across your site, just include it in quotes:

I should also mention that searching for a unique block of content in quotes is a cheap and easy way to find out if people have been scraping your site. Just leave off the “site:” operator and search for a long or unique block entirely in quotes.
Of course, these are just a few examples, but if you really need to dig deep, these simple tools can be used in powerful ways. Ultimately, the best way to tell if you have a duplicate content problem is to see what Google sees.
(3) SEOmoz Campaign Manager
If you’re an SEOmoz PRO member, you have access to some additional tools for spotting duplicates in your Campaigns. In addition to duplicate page titles, the Campaign manager will detect duplicate content on the pages themselves. You can see duplicate pages we’ve detected from the Campaign Overview screen:

Click on the “Duplicate Page Content” link and you’ll not only see a list of potential duplicates, but you’ll get a graph of how your duplicate count has changed over time:

The historical graph can be very useful for determining if any recent changes you’ve made have created (or resolved) duplicate content issues.
Just a technical note, since it comes up a lot in Q&A – Our system currently uses a threshold of 95% to determine whether content is duplicated. This is based on the source code (not the text copy), so the amount of actual duplicate content may vary depending on the code/content ratio.
(4) Your Own Brain
Finally, it’s important to remember to use your own brain. Finding duplicate content often requires some detective work, and over-relying on tools can leave some gaps in what you find. One critical step is to systematically navigate your site to find where duplicates are being created. For example, does your internal search have sorts and filters? Do those sorts and filters get translated into URL variables, and are they crawlable? If they are, you can use the “site:” command to dig deeper. Even finding a handful of trouble spots using your own sleuthing skills can end up revealing 1000s of duplicate pages, in my experience.
I Hope That Covers It
If you’ve made it this far: congratulations – you’re probably as exhausted as I am. I hope that covers everything you’d want to know about the state of duplicate content in 2011, but if not, I’d be happy to answer questions in the comments. Dissenting opinions are welcome, too. Some of these topics, like pagination, are extremely tricky in practice, and there’s often not one “right” answer. Finally, if you liked my panda mini-poster, here’s a link to a larger version of Pandas Take No Prisoners.
Update: Post-publication, a handful of people requested a stand-alone PDF version of the post. You can download it here (22 pages, 560KB).
The catchy theme/motto of the PECL/mysqlnd_ms 1.2 release will be Global Transaction ID support. Hidden behind the buzzword are two features. We will allow users to request a certain level of service from the replication cluster (keyword: consistency) and we will do basic global transaction ID injection to help with master failover.Failover refers to the procedure of electing a new master in case of a master failure.
Global Transaction ID support is the 1.2 motto/theme
The two features are somewhat related, thus the theme. In very basic words, the idea of a global transaction ID is to have a sequential number in a table on the master. Whenever a client inserts data, the ID/counter gets incremented. The table is replicated to the slaves. If the master fails, the database administrator checks the slaves to find the one with the hightest global transaction ID. Please find details, for example, in Wonders of Global Transaction ID injection.
What the plugin will do is inject a user-provided SQL statement with every transaction to increment the global transaction counter.
However, there is also a client-side benefit to global transactions IDs. If you want to read-your-writes from a replication cluster, you usually query the master. You won’t go to the slaves, because you do not know if they have replicated your writes already. In case you need read-your-writes, set the master_on_write config setting in version 1.1. In version 1.2 we can offer more, if you want and need it. We can search for a slave who has replicated the global transaction ID of your write to reduce the read-your-write load on the master. The keyword here is consistency and the background posting is Consistency, cloud and the PHP mysqlnd replication plugin. However, consistency is not nearly as nice as a motto as the catchy global transaction ID theme.
Of course, the day the MySQL Server has built-in Global Transaction IDs, we don’t need to do the injection any more. Meanwhile, we give it a try… a report from the hacks of the past two days. Feedback is most welcome.
Warning: this now becomes a posting for hackers, not users. If you are not after implementation details, stop reading. The big news is the theme, nothing else. If you don’t trust any software you have not developed yourself but you like the idea of a replication and load balancing plugin, continue reading.
First try: injection
… our first attempt on global transaction ID injection is straigt forward. By default, injection is done only for queries that go to the master. By default, all PHP MySQL APIs use auto commit. In the most basic case we just inject SQL before the query from the user. Doing it first avoids hassle, if the users statement returns a result set. Injecting before the users statement also means, we increment regardless of the success of the users statement.
$mysqli->query("SELECT 1");
$mysqli->query("INSERT INTO test(id) VALUES (1)");
SELECT ->
slave ->
auto commit on ->
query(SELECT)
INSERT ->
master ->
auto commit on ->
query(INJECTED), query(INSERT);
Optionally, we allow doing the injection on slaves as well. It can be configured if errors caused by injected SQL are ignored or reported, e.g. if the global transaction ID sequence table is unavailable.
If not in auto commit mode, we do the injection when the user invokes the user APIs commit() function. This is possible as of PHP 5.4. We do not monitor all statements to catch query(COMMIT) calls. Same constraints as for 1.1’s trx_stickiness config setting.
Andrey, the king of mysqlnd, proposed to consider query(INSERT), ..., query(INJECTED). In this case we would not increment the global transaction ID, if the users INSERT fails. However, its something for the king himself to evaluate. In other words: its beyond my skill level to do within hours. I’m somewhat sceptical its worth the efforts.
We also started looking into using multi statements. In this case, we prepend the users statement with the SQL to maintain the global transaction ID and run the resulting statement as a multi statement. Shown is a prefixing example. Its implemented as a hack for buffered non-prepared statements. We need to benchmark, if its worth the complicated logic over the initial approach.
$mysqli->query("INSERT INTO test(id) VALUES (1)");
INSERT ->
master ->
set_server_option(MULTI_STATEMENT_ON) ->
query(INJECTED; INSERT) ->
more_results ->
next_results ->
store_results ->
set_server_option(MULTI_STATEMENT_OFF)
First try: service level
In the area of "consistency" and service level, our first approach looks promising. The time Andrey invested into the 1.1 release to implement the filter logic starts to pay off.
In short, filter mean that we have a sequence of independent tools to find a node for running a statement. A bit like Unix command line tools that are connected on the command line with a pipe.
query(SELECT) ->
all masters, all slaves ->
filter(LOADBALANCING) ->
certain slave
We now have a new quality-of-service or "consistency" (qos) filter. For background information on the idea, see Consistency, cloud and the PHP mysqlnd replication plugin.
If, for example, the quality-of-service (consistency level) you need from the cluster is read-your-writes, you can set it in the plugin configuration file and create a filter chain like this:
query(SELECT) ->
all masters, all slaves ->
filter(QOS, STRONG_CONSISTENCY) ->
all masters, no slaves ->
filter(LOADBALANCING) ->
certain master
This is not much of a win over the already existing master_on_write configuration setting. However, together with the new filter, we also introduced a new API call to change the filter chain at runtime. You don’t have to configure read-your-writes (master_on_write) when setting up the plugin, you can set at runtime - on demand.
Let a filter chain like this be given:
query(SELECT) ->
all masters, all slaves ->
all masters, all slaves ->
filter(LOADBALANCING) ->
certain slave
Then, at run-time you place an order in your shop and you need to read-your-writes for a short period, you do:
mysqlnd_ms_set_qos(MYSQLND_MS_STRONG_CONSISTENCY);
$mysqli->query("SELECT id FROM orders");
/* ... do more queries that must not return stale data ... */
This will change the filter chain accordingly on-the-fly.
query(SELECT) ->
all masters, all slaves ->
filter(QOS, STRONG_CONSISTENCY) ->
all masters, no slaves ->
filter(LOADBALANCING) ->
certain master
Once you are done with the consistent reads, you can go back with one API call to eventual consistency (use masters and slaves, which may or may not serve current data.
That can save you a good number of SQL hints required in 1.1, if not using master_on_write.
… and back to the beginning
With the new filter and the new API call, we can also allow things like this:
$mysqli->query("INSERT INTO orders(...)");
$global_trx_id = mysqlnd_ms_get_global_trx_id($mysqli);
mysqlnd_ms_set_qos(MYSQLND_MS_SESSION_CONSISTENCY, $global_trx_id);
$mysqli->query("SELECT id FROM orders");
/* ... do more queries that must not return stale data for table orders... */
In this case we can read from any master and any slave which has replicated a certain global transaction ID. This is where we get back to the beginning. And, we open up for the future.
Imagine, with a distant release (not 1.2!), you could ask for data that is no older than 2 seconds. The plugin would either read from the master, or a slave lagging no more than 2 seconds, or fetch the result from a local TTL cache, such as PECL/mysqlnd_qc, with a TTL of 2 seconds…
mysqlnd_ms_set_qos(MYSQLND_MS_EVENTUAL_CONSISTENCY, MAX_LAG, 2);
$mysqli->query("SELECT id FROM news");
Happy hacking to all of us…
Posted by Megan Singley
You know that corner sandwich shop that you love so much because they'll make you the exact sandwich you crave? Yea, so this is kinda like that... well, but not. Here at SEOmoz we think a lot about how we can not only come out with new features, but also improve existing tools so our members can get the most value out of PRO. As a part of the help team, I can’t stress enough how much we crave feedback from our users, so we thought we’d take this opportunity to get your two cents. Please take a moment to fill out the survey below and help make our Rank Tracker that perfect sandwich.
Loading...
It has been a wonderful 5 weeks over at the ideasproject.com N9 Challenge, with a great community of active contributors. The result is that we now have over 2,500 ideas stored away on every UX topic from hardware to music players.
Throughout the 5 weeks ideas were evaluated as they came in - resulting in 5 lucky contributors getting a Nokia N9. An elite jury of designers will now review the ideas and award Nokia N9s for the best 10 ideas.
As for the rest, the IdeasProject will maintain the site for the foreseeable future so people interested in enhancing the Nokia N9 UI, building improved applications, or creating a new UI can view and implement the ideas listed. And, as mentioned before, the final jury contains several design leaders from Nokia, so I would not be surprised if some of the best appear in a Nokia device in the next few years.
To all those who contributed, commented, and otherwise participated we would like to say Thanks!
Hi, I’m Dan and I’m the Product Manager for Nokia Maps on Windows Phone and MeeGo. I’m pleased to announce that Nokia Maps for Windows Phone 7.5 is now available in the Marketplace on your Nokia Lumia 800 and 710, bringing the signature Nokia Maps experience to your device. Whether you on in the car or on foot, together with Nokia Drive, Nokia Maps on Nokia Lumia devices make sure you have the world right into your pocket, with fast maps and millions of places that make anywhere feel local.
The fun thing about working in technology is that sometimes you are lucky enough to build products you can use every day. And Nokia Maps is my favourite example so far. We ask the questions ‘where should we go?’ all the time – even when we are at home. To help answer that question we built the Nokia Maps experience for Lumia combining the great Symbian assets our fans were used to with the beautiful Panaroama experience on the Windows phone. And the result is an experience we think you are going to love when you are out and not sure where to go next or how to get there.
With Nokia Maps you can find the best nearby places, then check out reviews and see photos and user comments to hear about other peoples’ experiences there. It’s all about seeing the big picture, getting a real feel for a place and making informed choices. Once you decide on your destination you can use smart routing that includes walking, driving, and public transport to get know how to reach your destination and how long it will take.
Of course, places are more than just points on a map, so we’ve brought back the human touch with Nokia Maps. Connect easily with your destination, check out the one touch access to phone numbers, email addresses and websites. Plus you can share places with your friends via email, SMS, Facebook, Twitter, LinkedIn or Windows Live.
Some of you had a chance to check out maps as part of demos since we launched the phone. And it’s great to see you’re loving it! Wp7lab team says that “With information about public transport, driving, local points of interest and even dinning locations, the Nokia Maps is a great companion for those that like to travel and explore new things.”
Mobiletechworld was of the opinion that “Nokia Maps will be a really awesome addition to Windows Phone”
Clinton Jeff wrote that “The motto of Nokia Maps on Windows Phone is “feel like a local everywhere” so if you’re traveling around, here’s a great app for ya. I’m a huge fan of Nokia Maps, as you might already know, so I’m excited to see it released as an alternative to the *ugh* Bing Maps app.”
Thomas Ricker said: “It starts with a “beta” splash screen before zeroing in on your location. But for many, the fact that Nokia Maps is available for download is reason to celebrate….. Maps provides local walking, driving and public transport directions while highlighting cafes and other local points of interests over 3G or Wi-Fi networks. Of course, there’s a social element to Maps allowing you to review and share your location as well. You can also pin your favorite places to the Windows Phone home screen for quick access in the future. “
On twitter, we have @slodge: “Wow! Nokia Maps on Wp7 is superb – it gives me complete directions including train and tube times to get into london – wow – stunning!”
@mega_me tested it in New Zealand and says “Nokia maps on #wp7 really works in New Zealand, found everything from: Airport, Sushi, School, Foodtown, there’s “Places”/local scout… “
@rikindshah thinks “Nokia maps on WP7 best maps around any other. Blew away with its accuracy and features. Superb.”
@CristianCson feels “Super nice with Commute in the directions in #Nokia #Maps #wp7”
Finally, @prjkthack sums it up with “Wow. Nokia Maps is actually very nice. #wp7”
I hope you will all join us in exploring the world and finding the best places, wherever you go.
Download Nokia Maps on your Nokia Windows Phone – go to the marketplace through your Lumia and search for Nokia Maps. Take your maps for a spin, and let us know how you liked it.
Nearly every time we talk about our infrastructure, people ask us why we own and operate our servers rather than host Stack Overflow and the Stack Exchange network in the cloud. Usually when people ask us this, they seem to want to convince us that we should be in the cloud. The debate usually then centers around cost.
Cloud vs Self Hosting Cost?
The hypothetical cost of Stack Exchange being in the cloud has come up on meta. It turns out that the cost is difficult to actually figure out. Some of the things you need to take into account are:
- More or fewer Sysadmins required? (People say with the cloud you need fewer system administrators, never been convinced of this though)
- Licensing Costs
- Owned vs Rented Assets
- How many cloud “servers” or instances you would need vs real hardware
- Cost differences when you consider high availability
To really get this analysis correct you really have to invest a lot of time into the analysis, and even then it will only be an estimate. We have looked at cloud computing costs and we think it would actually be higher. When it comes down to it though the cost debate misses the point.
We Love Computers
and every aspect about them. We don’t just love programming and our web applications. We get excited learning about computer hardware, operating systems, history, computer games, and new innovations. Loving computers is an essential part of our company culture. Many of us have assembled our own workstations and our CTO even blogs about it in seven articles when he does. Most of us have grown up with computers as part of our identity. We all have a shared nostalgia of our first computers — if we haven’t taken our pilgrimage to the The Computer History Museum yet then we dream about it. We like to think about about the past, present, and future of computing. Owning and operating our own servers is part of how we get to live out our love of computers.
This culture means when we hire technical staff, we hire people who share this passion. I believe that this passion translates into a better product. Whenever someone does a cost analysis of cloud vs self hosting there is no row in the spreadsheet for “Work Productivity Increase due to Passion.” We are performance and control freaks and love to tweak everything including our hardware. If we outsourced our hosting to cloud computing, we would be outsourcing part of our passion. If you just want to use someone else’s computers, it means you don’t love computers — at least not every aspect to them. Sometimes cloud computing may be the best fit (for example if you have 20x the traffic around the holidays or tax season), but if you truly love computing, giving up control of computers to someone else will hurt.
We don’t just like computers, we love them. We have an emotional connection to them, and suggesting that we let someone else own, manage, and tweak them is like suggesting we get rid of what we love — just the thought of it offends.
Nearly every time we talk about our infrastructure, people ask us why we own and operate our servers rather than host Stack Overflow and the Stack Exchange network in the cloud. Usually when people ask us this, they seem to want to convince us that we should be in the cloud. The debate usually then centers around cost.
Cloud vs Self Hosting Cost?
This hypthetical cost of Stack Exchange being in the cloud has come up on meta. It turns out that the cost is difficult to actually figure out. Some of the things you need to take into account are:
- More or fewer Sysadmins required? (People say with the cloud you need fewer system administrators, never been convinced of this though)
- Licensing Costs
- Owned vs Rented Assets
- How many cloud “servers” or instances you would need vs real hardware
- Cost differences when you consider high availability
To really get this analysis correct you really have to invest a lot of time into the analysis, and even then it will only be an estimate. We have looked at cloud computing costs and we think it would actually be higher. When it comes down to it though the cost debate misses the point.
We Love Computers
and every aspect about them. We don’t just love programming and our web applications. We get excited learning about computer hardware, operating systems, history, computer games, and new innovations. Loving computers is an essential part of our company culture. Many of us have assembled our own workstations and our CTO even blogs about it in seven articles when he does. Most of us have grown up with computers as part of our identity. We all have a shared nostalgia of our first computers — if we haven’t taken our pilgrimage to the The Computer History Museum yet then we dream about it. We like to think about about the past, present, and future of computing. Owning and operating our own servers is part of how we get to live out our love of computers.
This culture means when we hire technical staff, we hire people who share this passion. I believe that this passion translates into a better product. Whenever someone does a cost analysis of cloud vs self hosting there is no row in the spreadsheet for “Work Productivity Increase due to Passion.” We are performance and control freaks and love to tweak everything including our hardware. If we outsourced our hosting to cloud computing, we would be outsourcing part of our passion. If you just want to use someone else’s computers, it means you don’t love computers — at least not every aspect to them. Sometimes cloud computing may be the best fit (for example if you have 20x the traffic around the holidays or tax season), but if you truly love computing, giving up control of computers to someone else will hurt.
We don’t just like computers, we love them. We have an emotional connection to them, and suggesting that we let someone else own, manage, and tweak them is like suggesting we get rid of what we love — just the thought of it offends.



























