Many of us use RSS feeds to get updates from our favorite news sites or blogs; however, RSS is only available on a very small portion of the Internet. Michael Tung, CEO of Diffbot, ran into this problem while trying to stay on top of class web pages while a student at Stanford. Since the sites didn’t have RSS, he developed a unique algorithm that found updates by looking at the visual layout of the page, as opposed to the HTML. After refining the program over the last two years, Diffbot has released an API that has allowed developers to accomplish some fascinating things with this technology.
“We provide technology that allows applications to interpret web pages like a human being,” explains Tung. “We’ve discovered that the entire Internet can be classified into about 30 different page types. What that means is that even though there’s essentially an infinite number of web pages on the web, there are certain common layouts and ways humans structure web pages that are understandable.”
Rather than looking at the tags or markup within a page, Diffbot looks at things like the X and Y coordinates for different parts of a page, the amount of screen real estate that each part is given, how a certain part is positioned relative to everything else on the page, and what kind of fonts and borders are used. Diffbot is developing an API for each type of page and currently has APIs available for both front pages and article pages.
“Developers are always trying to use web data, for example,” says Tung. “That’s why geeks like us created things like RSS and ways of syndicating data, essentially creating this machine-readable layer of the Internet. But as we’ve seen, adoption of those semantic formats, Open Graph for example, has never really taken off because there’s a chicken and egg problem. What we hope to do is create artificial intelligence software that can automatically understand what’s the layout versus what’s the actual information and extract that automatically without humans having to create those annotations themselves.”
Hundreds of developers are already building apps on top of the available APIs. For example, one developer built a radio station powered by Diffbot. The developer’s father is blind, which can make it extremely difficult to use the web, even with a screen reader. So the developer created Hacker News Radio, which pulls in a Hacker News RSS feed, passes the articles through the Diffbot API to get the actual text, and then sends the text through text-to-speech engines so it can be heard.
“It’s really, really simple,” explains Tung. “You just pass in a URL, you pass in your developer token and you get back a JSON that has all these fields in it. For the article API, those fields would be the title, who the author is…[and] the text.” Diffbot can even recognize if an article has multiple pages, so it’s sure to pick up the entirety of the text.
Diffbot web site: //www.diffbot.com/
Diffbot blog: //www.diffbot.com/blog
Diffbot profile on CrunchBase: //www.crunchbase.com/company/diffbot