Hi, my name is Maile Ohye, and I work at Google as a Developer Programs tech lead. I’m so glad to be speaking to you today because for me and on my behalf of all my colleagues at Google, we understand how important it is to have a strong news ecosystem so I hope you find something in this presentation that you find useful. Today we’re going to talk about three main topics.
First the ranking factors of Google News search. Next we’re going to cover some of the frequently asked questions that we hear from publishers or from SEOs. And last we’re going to talk more about the best practices when you publish articles. So let’s take a first look at how your articles appear in a Google search result.
There are several ways. First is obviously on google.com, where people might see a news onebox. And this here in the upper screen shot shows you a news result for a search like Obama medals, where now the user is shown some news article.
There’s one way where your articles can appear in Google news. On this second screenshot this is from a user going directly to news.google.com and here’s where they see a similar cluster of articles but instead of the google.com homepage they’re seeing it on the news home page.
So you might be asking yourself, “How did these articles appear?” Now the way we gather these articles are by first crawling it, next grouping it, and then ranking all of the information. And we’ll cover each of these steps more in depth. Let’s start with crawling. In the crawling stage, much like websearch, we have Googlebot who’s going to go out to your news sites to look for new articles. And there’s two ways that we retrieve these articles.
One is through our discovery crawl where Google sees new URLs and then crawls those articles, but in addition to that discovery crawl you can also create News Sitemaps. And News Sitemaps are a way for you to list exactly what are your new URLS, and so we can use that as well in addition to our discovery crawl to find your new information.
And of course, we respect the Robots Exclusion Protocol, so you can create a robots.txt file or use http headers to let us know specifically what documents you want crawled and what documents you want excluded from Google search results. Last, once we’ve crawled and made sure that we’ve only crawled what we are allowed to crawl, we bring those articles back to Google. And that’s the end of the crawling phase.
So next we get into that grouping phase, and here’s where we have this classification idea. In classification, what we’re doing is actually looking at each individual article’s contents. So you can see on this article “The millions Kozlowski didn’t steal” . We actually take out individual words like “business” , “tycho” , “money” and “cfo” and understand that this article pertains to the section of business.
And that’s how we populate those different sections in Google news like Business, Health and Entertainment. Another thing we’re doing is populating our additions whether it’s UK, US or India. And we can take that from the text as well. Here we’ve taken words like New York and Manhattan and that’s led us to believe that this article pertains to the United States.
So this is that grouping stage where we understand what an article is about and also what sections and additions it pertains to. So now that we’ve covered crawling, grouping, we now have ranking. And ranking is going to come in two phases.
First of course is story ranking. Story ranking is much like what you see on the Google News page where there’s a group of stories, whether it might be Obama and the medal ceremony, or it might be the death of Michael Jackson. Or it might be rising oil prices.
Story ranking is deciding which of these stories should be placed higher which second, which third. That type of idea. These cluster of stories. And we rank these story clusters according to aggregate editorial interest. So let’s take a deeper look at what that means. In the upper diagram you can see that a smaller story has a small effect on publishing activity.
Let’s say in North Carolina a man was giving out free cars to those that really needed it. That’s a great human interest story. It might be covered in a local paper and also be picked up by a few wires. But this is still a relatively small story not showing as much aggregate editorial interest as say a larger story, like the death of Michael Jackson, which is not only published on a local newspaper, but foreign and national papers, covered by many wires, also including op-ed articles and follow-up articles.
You can see that due to all the editorial interest about this story we will likely rank it higher than the interest story about a man giving out free cars in North Carolina. So that’s story ranking. We’re actually ranking those clusters. The next part about ranking is the individual article ranking.
Article ranking helps us take a cluster of stories, say the death of Michael Jackson, and helps us determine out of those 200 stories which one should be ranked first for our users, which should be ranked second and so on. There are many signals that go into article ranking, but I’m just going to cover four of the major ones for you here.
First is fresh and new. It’s important to us that an article contain recent substantial information about a topic. And it needs to be objective news to lead this cluster of stories. So press releases, satire, op-eds aren’t eligible to lead clusters.
Another factor is duplication and novelty detection. And that’s where we try to determine an original source of content from those that are duplicating the information. So something that we use there is this idea of citation rank.
So for an article we can see that if a news story was broken by the Los Angeles Times and then later another article saying Washignton cited the Los Angeles Times as being the source of the information then we can start to see the citation rank taking place for this story.
That this article from the Los Angeles Times might have higher ranking now because other people are citing it as being an original story. Another factor is local and personal relevancy. And this applies to individual sections, as well as additions of your publication. So what we want to do is actually give more weight to local sources that are likely more relevant to the news item.
So if we take that idea of a man giving out free cars in North Carolina, it’s likely that we would take a paper like the Charlotte Observer, and know that could be a higher authority for that story and therefore that article might be ranked higher in this cluster. The last signal I wanted to cover in article ranking is the idea of trusted sources. For us trusted sources doesn’t have to do with some arbitrary decision that we make, but it’s actually data driven.
So according to our data over time, did users start to look at your articles and then click on them? Let’s say that there were five articles being listed and a significant amount of users chose the third article and went to that source. Then we might start to determine that this source is actually very trusted for this certain type of information and over time we start to build out what publications are trusted sources.
But not for their entire publication, it’s done on a section and category basis. So something like the Sporting News could be very trusted for sports information but maybe not so much for business. And likely something like the Wall Street Journal might be very trusted in the United States for business information but maybe not in India.
So again, these trusted sources have to do with section and addition. So it’s a very specific thing that we’re looking for due to aggregate user behaviour. So those are just four of the signals that we use in news Search article ranking. Next let’s go into some of your frequently asked questions.
You might be asking “What are the benefits of submitting a News Sitemap?” Well, we think that Sitemaps are beneficial to us and to you as a publisher as well. First of all they provide you greater control over which of your articles appear in Google News.
And that’s because, as I mentioned earlier, they help compliment our discovery crawl and tell us exactly what articles are new and which articles we should crawl. Second, News Sitemaps are great because they help you give us meta-information about your articles. So rather than rely on our extractor you can give us the publication date.
And rather than rely on just our extractor to determine the categories for your article you can give us good hints by using the keywords field. So all in all, we think News Sitemaps provide a huge benefit to publishers. Another frequently asked question is “Can Googlebot visit our URLs more than once?”
And the answer is yes, we can definitely recrawl URLs to check for updates. But just taking a step back. Initially Google can actually find your new content within a matter of minutes of when you published it.
And we find your new content through our discovery crawl or through news sitemaps and after that initial discovery we will definitely go back and re-check for new article content. So the time at which we may re-crawl varies, so that re-crawl rate varies, but its pretty safe to say that we’ll probably go back and check for new content within 12 hours.
So we’ll find it within a matter of minutes and we’ll re-crawl for new content within 12 hours. You might also be asking “How do I optimize my multimedia content?” Well that’s a great question. So we’re going to take a look at two types of content. First, let’s talk about videos. With videos you can create a youtube channel and submit that to us.
We’re looking to include other types of video hosters, but right now with Youtube we have a pretty good idea of the user experience, that the video will load etc., so youtube is a trusted video hoster platform for us.
And if you do use Youtube remember that including textual descriptions and transcripts are also helpful because that helps us associate a specific video with the subject matter. Now let’s talk about images.
With images we have five tips that will help your images get included in Google News Search. First you want to use a large size image with a good aspect ratio. Second you want descriptive captions and alt text. Third you want to keep your good image near the title.
And that again helps us associate an image with the subject matter. Fourth, you want your good image to be inline and not a clickable version. So again you want your good image near the title and inline.
And last we prefer JPG. So if you use things like PNG images that’s not as good for Google News as a JPEG. So I would definitely stick to JPEG if you would like your images included in Google News. So the last frequently asked question is of course “What about PageRank?”
PageRank is a lesser factor in Google News than it is in Websearch. And that makes sense right because the linking structure for an article that was only published minutes ago isn’t going to be the same as one that was published years or months ago. So we have to use PR delicately in Google news.
So instead of using signals like PR we actually use more signals like we talked about earlier. Which is things like timeliness. Is it fresh and new? Or does it have local or personal relevancy.
Those types of things. So now that we’ve covered how Google crawls and groups and ranks articles and we answered some of your frequently asked questions let’s just get in to some best practices.
First, it’s important that you create permanent unique URLs with at least 3 digits. And the reason for this is, is that traditionally, news publishers have used article Ids and then equals a number in their url strings. And that has helped us to determine that its an article and not just a static html page.
But if your news publishing system doesn’t include digits, three at least three for Google News, then you can actually submit a News Sitemap. So that’s the workaround. If you don’t have 3 digits in your URLs, you can create a News Sitemap and let us know which specific URLS belong in News.
The second best practice is to not break up the article body. So in your news article it should have sequential paragraphs that can all be included in Google News. You don’t want to break that up with user comments or links to related posts or even if you have things like it links to additional pages.
That’s not as good for Google News. We’ll take all the article on that first page. So look again to not break up the article body. A third best practice is to put dates between the title and the body and that will help our date extractor to have the correct publication date. Fourth, titles matter. And this is to have a good HTML title as well as an article title.
So you want your title to be extremely indicative of the story at hand. Fifth , its best for Google News if you separate your original article content from your press releases. And you can do this in a directory structure. And this helps us determine what is specifically a news article versus what might be satire or opinion or a press release.
And the last tip of course is to create unique and informative content. And taht’s always going to help you do well in the rankings. So the more unique content that you create, and the more users that enjoy that, the more users will send there and this is kind of converse to the idea of just publishing other people’s content or just having duplicate information.
So again, the greater information that you put out for all of us to read the more users you’ll attract to your site. If you have additional questions, please feel free to visit our News Publisher Help Center and thanks so much for reading..