Data Science & Analytics

Kaggle for Data-Driven Investors

Dr Justin Chan

January 14, 2019·5 min read

Did you know that data science can be used to help whale conservation? With several species experiencing sharply declining numbers – and some even nearing extinction like the vaquita porpoise - imaging technology is being used to track whales across the world’s oceans. Scientists are using photos to observe whales’ unique identifiers, such as the shape of their tails and the markings on their bodies. Kaggle recently created a competition that challenges data scientists to build the best algorithm for identifying whales from Happywhale’s database of 25,000 images. The competition is just one of many that Kaggle runs for data science practitioners to test their skills, to help advance the field of data science, and to potentially win a lucrative cash prize for their efforts. Kaggle

What is Kaggle?

“A platform for the world’s largest community of data scientists and machine learning engineers,” is how Kaggle describes itself. Owned by Google, Kaggle aims to be the world’s premier collaborative environment for data scientists, offering an immersive one-stop shop to learn and build expertise on the subject, both by engaging with a vibrant community of data scientists and ML engineers and by learning and practicing by oneself. And that means Kaggle can be a highly useful tool for data-driven investors. With over 13,000 datasets at present, Kaggle offers a veritable gold mine of data for you to work with. They cover a diverse range of subjects – just a quick glance reveals extensive datasets being recently published covering art, climate, social issues, and economics. For data-driven investing, Kaggle’s datasets covering finance and investing will come in very handy. Historic crypto prices, house prices, tax statistics, and macroeconomic figures are just some of the datasets on offer under this category. With its data universe growing all the time, moreover, it’s likely that Kaggle will provide you with useful data for making more informed investing decisions – if not now then certainly in the near future. But if you don’t see the data you are looking for, you can publish your own datasets to Kaggle to work on, and to contribute to the community. data driven

But what can one do with this data?

A lot. Especially if you are fluent in Python and R, Kaggle offers a dynamic environment for working with data and building data science portfolios. You can start a data science project of your own, or you can explore projects that have been created by other members of the community. Arguably Kaggle’s most prized feature is Kernels. This is a cloud environment that supports collaborative coding and analysis. Essentially, it contains code that makes your model reproducible and allows collaborators to participate. There are 3 different types of Kernel available:

Scripts - files that execute code sequentially. You can execute scripts either in R or in Python.
RMarkdown Scripts – Scripts that execute the preferred RMarkdown code – a combination of R and Markdown editing syntax – rather than basic R code.
Jupyter Notebooks – the popular open-source software for data scientists to create and share codes, consisting of a sequence of cells. Each cell is formatted in either Markdown or in a programming language of your choice. Conveniently, Kernels allow you to run Notebooks from your browser.

By choosing from the many thousands of Kaggle’s publicly available datasets, or by uploading your own data, you can create your own Kernel with just a few clicks. From there you can then employ a number of analytical functions in the Kernel, such as visualizing the data. Kernels also enables you to discover a huge database of open-source, reproducible code which means you can check out the latest updates being made by the data science and ML community. Both the Kaggle homepage and the Kernels listing page provide the list of Kernels that have been written on the site, with the “hottest” ones in terms of high user engagement or consistent popularity listed first. And since Kernels are cloud-based, you can work from the browser in your laptop whilst reassuringly having the backing of powerful hardware (CPU: 4 CPU cores / 17 Gigabytes of RAM / 6 hours execution time; and GPU: 2 CPU cores / 14 Gigabytes of RAM). kaggle

source

Kaggle also offers a range of data science competitions for those who have got sufficient skills. Indeed, with potentially tens of thousands of dollars on offer as prize money, such contests are incentivizing data science experts to not only participate but to also stick around as active members of the Kaggle community. As a data-driven investor, why not enter the House Price competition as practice to boost your own knowledge and skill set? Or if you’re feeling more confident, try the Two Sigma competition “Using News to Predict Stock Movements” for a chance to win a whopping $100,000? Ultimately, it is the community-aspect of Kaggle that sets it apart as fertile ground for the evolution of data science. Being able to connect with a vast and growing community of other data scientists and ML engineers to solve global problems is priceless. In addition to seeking advice from such experts, being involved in this community provides one with many opportunities to grow his/her own professional network. You can of course reciprocate and pass on your own skills to those who wish to learn from you. And by contributing solutions to data science problems, winning competitions or simply by being an active member of the community, you will be able to boost your own stock and visibility on the site. Lastly, for the data-driven investors out there, having the ability to comprehensively analyze data, utilize a variety of analytical tools, and collaborate with many others around the world, could mean tremendous progress toward making more accurate and informed investment decisions.

Dr Justin Chan

Dr Chan founded DataDrivenInvestor.com (DDI) and is the CEO for JCube Capital Partners. Specialized in strategy development, alternative data analytics and behavioral finance, Dr Chan also has extensive experience in investment management and financial services industries. Prior to forming JCube and DDI, Dr Chan served in the capacity of strategy development in multiple hedge funds, fintech companies, and also served as a senior quantitative strategist at GMO. A published author at professional journals in finance, Dr. Chan holds a Ph.D. degree in finance from UCLA.

LinkedIn →

Kaggle for Data-Driven Investors

What is Kaggle?

But what can one do with this data?

More in Data Science & Analytics

Quality Data, Quality Decisions: Why Web Scraping is Essential for Advanced Analytics

Supply Chain Blind Spots: The Psychology of Hidden Risks

What are we solving with analytics?