Google Plays Nice: XML sitemaps, images, and a mystery
Ian Lurie Jul 20 2010
Last month, Google announced that they’re now accepting mixed-media XML sitemaps: You can put images, video and regular page URLs all into the same map.
I saw this and rubbed my hands together while cackling maniacally. I could finally make sure that valuable images that clients paid for, or paid to have taken, would get indexed!
I won’t go into the details of Googles image, video and page sitemap spec. You can read it all in their post. But my fumbling about in code led to an interesting discovery: Google’s not all that picky.
I wrote my own little Python crawler over the weekend.
Yes, I know how pathetic that sounds.
Anyway, I wrote a Python crawler. It goes out to a site, grabs the URLs of all pages and all images, and puts ’em all into an XML sitemap. Neato! My first real use of Python. Alas, I did it horribly wrong, and generated a sitemap that munges images and page URLs together as if they were all the same. That is, I put both images and page URLs between <url> and </url> tags. Turns out, that’s wrong.
According to Google’s post, you have to use all sorts of fancy code to insert images and video into an otherwise normal XML sitemap:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.sitemaps.org/schemas/sitemap-image/1.1" xmlns:video="http://www.sitemaps.org/schemas/sitemap-video/1.1"> <url> <loc>http://www.example.com/foo.html</loc> <image:image> <image:loc>http://example.com/image.jpg</image:loc> </image:image> <video:video> <video:content_loc>http://www.example.com/videoABC.flv</video:content_loc> <video:title>Grilling tofu for summer</video:title> </video:video> </url> </urlset>
Eesh. It’s actually a good thing I didn’t see that before I started. I would’ve thrown my hands up and never learned Python.
But here’s the thing: I submitted the incorrectly-formatted sitemap before I knew I’d done it wrong. Once I saw the error of my ways, I got ready to watch Google wallop my company site, or at least ignore all of the images. But they didn’t. Instead, Google is indexing the images I included in the sitemap, even though the sitemap’s wrong.
Before I generated the sitemap, a site:portentinteractive.com search in Google images showed about 100 images. The day after, it showed 180 images. Today, it shows 193 images.
Clearly, Google’s tolerant of dorky semi-competent programmers like me. But the real question, and I’m honestly curious if anyone out there knows, is: Do we have to use the fancy formatting, or is packing images into URL elements going to work for the long term?
Comment below if you have a theory, or if you happen to be a Google engineer.
Related, recent, and whatnot

Ian Lurie
CEO
Ian Lurie is CEO and founder of Portent Inc. He's recorded training for Lynda.com, writes regularly for the Portent Blog and has been published on AllThingsD, Forbes.com and TechCrunch. Ian speaks at conferences around the world, including SearchLove, MozCon, SIC and ad:Tech. Follow him on Twitter at portentint. He also just published a book about strategy for services businesses: One Trick Ponies Get Shot, available on Kindle. Read More