Web Scraping and Hedge Funds’ Alternative Data Strategy

4 min read

Data that was once hiding in plain sight can hide no more. With sophisticated software, new data continues to be extracted – or ‘scraped’ – from a growing variety of online sources. Coupled with more in-depth analytics, hedge funds are now capitalizing on a new and increasingly essential source of alpha.

Earlier this year, Greenwich Associates and Thomson Reuters released a joint study that provides a fascinating insight into how the investment research landscape is dramatically changing.

Titled “The Future of Investment Research,” the study covers a handful of contributory factors underpinning this transformation and contains some particularly telling observations about alternative data.

Investment Research

We have previously discussed just how important alternative datasets such as satellite imagery and geolocation data are proving to hedge funds. These data sets offer a wealth of untapped alpha for those institutions willing to spend money on its procurement, thus enabling them to exploit that crucial informational edge over their market peers.

Indeed, the Greenwich / Thomson Reuters research shows that the average investment firm is spending about $900,000 yearly on alternative data, while its estimate of annual industry budgets for alternative data currently stands at $300 million. That’s almost twice as much as a year ago. And of this alternative data, clearly, the most popular form being used by investment professionals is web-scraped data.

the average investment


Web scraping (also known as ‘automated data collection’, ‘data scraping’ and ‘spidering’) refers to the process of using software to pull in potentially valuable data from websites. For hedge funds, paying companies to obtain such data can help them make smarter, better informed investment decisions before the rest of the market catches on.

One such company is Quandl, which is now decidedly positioned at the front and center of the alternative data revolution. The Canadian company often compiles datasets by scraping the web itself, or by joining forces with a domain expert, and then selling the data on to hedge funds and other interested customers.

The web-scraped data itself takes many forms, including “product pricing, search trends, insights from expert networks, and web traffic data,” according to the Greenwich report.

For example, by scraping web traffic from Alexa .com, Goldman Sachs Asset Management was able to identify a sharp rise in visits to the HomeDepot.com website. This allowed the asset manager to buy the stock well in advance of the company raising its outlook and its stock eventually appreciating.

Among its numerous strategies, meanwhile, the renowned alternative data company for asset managers Eagle Alpha scrapes pricing data from large retailers, which has “proved valuable at providing a directional indicator for the sales of consumer products.” Scraping data from US electronics websites, for instance, allowed the firm to observe diminishing demand for GoPro products, and thus conclude, correctly, that the action camera manufacturer would miss its 2015 Q3 targets. This was despite the fact that “more than 68% of stock recommendations were to buy” a mere two days prior to GoPro’s eventual announcement that it had underperformed.

And then there’s social media, which is proving highly effective in ascertaining what is currently ‘trending’, and what is not. Twitter, in particular, has become an extremely popular source of meaningful information, with investors analyzing millions of tweets and retweets to trade stocks in a timely fashion.

As Bloomberg recently declared, “Access to the Twitter stream offers one of the largest and most nutritious alternative data sets for alpha-seeking researchers,” having just incorporated a Twitter stream into its news service that scans newsworthy tweets to publish. And there’s even a well respected research paper that found that “collective mood states derived from large-scale Twitter feeds” could predict Dow Jones movements with an astonishing 87.6% accuracy.

A November 2017 survey by EY found over a quarter of hedge funds were using or planning to use social media data in their investment strategies over the ensuing 6-12 months. And obtaining data from the likes of Twitter, Facebook and YouTube is usually done directly through the providers themselves, or via third-party platforms.

But having scraped the popular, more easily accessible websites such as Twitter and Amazon, hedge funds must constantly find new and unique data sources to unearth even more accurate trading signals and stay ahead of the competition. In this regard, it would appear there is no end to how deeply firms can – and do – delve.  It may even include the dark web.

the dark web

It is also likely to involve data on individuals/customers that can be scraped from sources as diverse as electoral registries, phone directories, criminal records and flight logs. Given the controversies surrounding personal data issues that have flared up this year, particularly in the wake of Facebook’s Cambridge Analytica scandal, it would seem that scrapers are on a collision course with advocates for stricter data privacy laws.

Quandl Founder and CEO Tammer Kamel has previously acknowledged that a “healthy paranoia” exists among companies to ensure personal information is removed before his company sells its alternative datasets, and that one misstep could have dire consequences for a fund operating in this space. But sufficient regulatory protection for individuals remains elusive at this stage. The type of data being procured by hedge funds isn’t necessarily anonymous with respect to personal information. And that means a lot of information about you can be compiled without sufficient oversight necessarily being in place as yet.

As the Hedge Fund Law Report acknowledged last year, “Despite the relative maturity of e-commerce, the legality of automated data collection is still unsettled. While there have been many cases that have examined scraping disputes under various state and federal statutes, the law is not uniform, and past decisions have been fact-specific in nature.” In fact, some complex legal cases have gone in favor of the scrapers…

In the US, the federal Computer Fraud and Abuse Act (CFAA) is a statute that imposes liability on anyone who “intentionally accesses a computer without authorization, or exceeds authorized access, and thereby obtains … information from any protected computer.” As such, it is cited by companies to try and prevent third-parties from harvesting data. But in the 2017 case of HiQ Labs (a workforce analytics company) vs LinkedIn, the latter used the CFAA to argue that the former violated its terms of use by using bots to scrape data from public user profiles. Ultimately, however, LinkedIn was legally ordered to remove technology that was preventing hiQ Labs from scraping, on the grounds that authorization is not necessary to access publicly available profile pages.

It should also be observed that web scraping is not always employed by honest actors. Cyber criminals can use it to destroy the reputation of a company, say, by stealing copyrighted content. Because web-scraping bots are indistinguishable from one another in terms of their intent, moreover, it may be impossible to determine the malicious ones from the good ones.

And the more sophisticated that web-scraping bots become, and thus the more they increasingly make their way into APIs and web applications – for example, by using proxy IPs – the more likely a malicious attack will be successful.

Anatomy of an attack

Anatomy of an attack (source)

But these growing concerns are unlikely to deter hedge funds from employing web scraping, especially if it continues proving vital to unlocking new and lucrative investment opportunities, and regulation of the space remains a work in progress. Indeed, one estimate suggests that web-scraping bots account for as much as 46% of Internet traffic. Even on a surface level, scraping the web for mentions of a particular company can provide hedge funds with a much clearer picture of its outlook and customer perception.

As more evidence points to just how vital web scraping is to the overall application of the hedge fund industry, ethical or not, it seems that the online world is now set to be analyzed more often and more closely than ever before.

Dr Justin Chan Dr Chan founded DataDrivenInvestor.com (DDI) and is the CEO for JCube Capital Partners. Specialized in strategy development, alternative data analytics and behavioral finance, Dr Chan also has extensive experience in investment management and financial services industries. Prior to forming JCube and DDI, Dr Chan served in the capacity of strategy development in multiple hedge funds, fintech companies, and also served as a senior quantitative strategist at GMO. A published author at professional journals in finance, Dr. Chan holds a Ph.D. degree in finance from UCLA.

3 Replies to “Web Scraping and Hedge Funds’ Alternative Data Strategy”

  1. This has been very informative to my undrstanding of alternative data! I had come across Quandl about a year ago but still was curious with the frequency of alternative data to trades to impact monthly returns on short term trading!

  2. I was very pleased to uncover this website.
    I want to to thank you for ones time for this particularly wonderful read!!
    I definitely liked every bit of it and i also have you book marked to check out new information in your website.

Leave a Reply

Your email address will not be published.