How we do it
How we approach the challenge of turning unstructured global online news into structured, machine-readable data.
- To monitor all relevant news sites in the world
- To detect all news articles on the sites we monitor
- To extract the news data as accurately as possible
- To add supporting metadata to each article
- To deliver the news in real-time
We continuously benchmark ourselves against other global news providers to make sure that we maintain our position as market leaders. While no provider can guarantee 100 percent accuracy, we will always keep raising the bar in terms of coverage, quality and speed.
- Manual configuration of all new sites
- Automated crawling and data extraction
- Manual maintenance supported by automated quality assurance
1. Manual configuration
New sites are added either by our customers’ request or as a result of our continuous benchmarking against other crawlers.
Our system automatically detects new article links and recognizes the headline, body text, author, and date of an article. However, since all websites are constructed differently, our configuration team will initially finetune the extractor to reduce the risk of incorrect or incomplete crawling.
Once a new website has been finetuned by our configurators, it is transferred to the automated crawling and quality control.
2. Automated crawling and data extraction
Phase 1: Detect new articles
We crawl all websites repeatedly to look for links that point to new articles. Article links are most often found on the front page or a section page of a website, e.g, sports, business, or culture. Our system is programmed to automatically match the frequency of visits to a section with the frequency of new articles being published on that section. In other words, if a section is updated with new article links very often, we will keep visiting that section very often to look for new links.
We strive to detect new articles as soon as they are published. In fact, we crawl each section so often that in 98 percent of our visits we find no new articles at all.
A publisher may post many different links to the same article in various places around the website, but our system is especially designed to avoid crawling the same article more than once. This is an important measure to avoid false duplicates in the news feed.
Phase 2: Extracting new articles
As soon as a new article link has been detected, we will open the link and attempt to extract the text from the article. Our automated quality control will alert us, if the extraction fails.
Text extraction is no easy task, as many articles are riddled with ads, promotions, irrelevant links and so on. Our extraction application is especially designed to avoid such noise, but you should not expect complete similarity between the article on the webpage and the extracted text. Furthermore, the publisher may edit the content of an article any number of times, whereas our version is a representation of the article in the moment we detected it.
3. Manual maintenance supported by automated quality assurance
The automated quality control consists of various checkpoints and alerts that tell us whether we are detecting new articles as expected (crawling phase 1), and if those articles are extracted correctly (crawling phase 2).
If some sources are particular important to you, we offer these additional manual services to further ensure the continuous delivery of news data.
We monitor the daily volume of articles to detect deviations. If we see an unusual deviation, we will review our crawling configuration to make sure that everything is working correctly.
We screen the Google Index to determine whether Google has found articles that we have not. If that is the case, we will refine our crawling configuration to catch those articles.
In rare cases our crawler is blocked by CAPTCHA tests. We offer a manual CAPTCHA bypass service for customers who need all articles from blocked sources.
The price of service is determined by the number of articles blocked by CAPTCHA.
When a source is added to the author detection list, we refine the crawling configuration to detect as many authors as possible.
We monitor the percentage of articles in which we detect an author. If the number of articles containing an author declines, we adjust the crawling configuration to re-establish author detection.
Opoint can deliver content from behind paywalls if:
1. the customer has signed an agreement with the publisher, and
2. the customer provides Opoint with login credentials.
We monitor the functionality of the login credentials at all time. Content which is fetched using a particular login will be delivered exclusively to the customer that has provided the login.
“Knowing that hundreds of thousands of users around the world depend on the content provided by Opoint Technology, we put all our efforts into maintaining a high quality, stable deliveries and the best customer support in the market.”
Jon Anders Tangnes, COO at Opoint Technology