As a newcomer to the field of data science it is pretty incredible to see the power of a simple algorithm in understanding the world around us. Classification, prediction, optimization, generation; all these things are within the realm of a few lines of code. At first glance it seems that throwing the right kind of statistical methods and algorithms at a problem will be sufficient to get the desired solution. At least that's the naive understanding one can find oneself operating with initially.
I recently attempted to throw my proverbial hat into the ring of the PLAsTiCC astronomy classification competition hosted by Kaggle about a year ago. Given a set of (projected) telescope readings of various light sources over time, could you classify said sources? While the task clearly isn't as straight forward as it may seem (what competition pays out for easy solutions?) it does appear somewhat approachable. Or so I thought... dramatic pause
Looking over the data and a few discussions by the original competitors it became apparent that the data centered around ~1.5 million objects represented by multivariate time series. My own understanding of time series is fairly limited, barely extending beyond simple data manipulation to allow for the application of basic ARIMA models. But surely the problem could be broken down to something simpler, something manageable that can be solved with fundamental approaches. Only one way to find out, right? Let's look at the data.
An explanation of each of the variables can be found here, but just reading up on all of them to have even an elementary grasp of the context of the problem took me quite some time. The high level explanation of what is presented in the data is that a telescope to be built in the future is expected to make certain observations across certain patches of the sky at various times, recording the light detected in different channels (thing RGB) and how much that light fluctuates.
So if you're anything like me just trying to grasp all of those variables at a conceptual level is dizzying. As a non-astronomer it took quite a lot of reading to get up to even a rudimentary understanding of the implications of many of the factors presented. What is stated in the competition is that all of these variables combine to give you what is called a light curve, what is essentially our time series. If you monitor the light given off by an object over time you can develop a profile that can help identify the nature of said object.
Spikes in the curve would indicate growing brightness, while dips indicate a dimming. So all we should need to do is plot the observations and group the profiles, right? Right, so let's do that.
Well that sure as Schrodinger doesn't look like a curve. How are you supposed to extrapolate any kind of time series from that? Turns out the telescope is rotating along with the earth (who'da thunk it?) and only has certain windows in which it can observe a given object. So what we have are unevenly sampled time series with large gaps in the data. Now, more experienced practitioners, or just smarter people than myself may have a natural intuition as to how this problem can be circumvented. Alas, I am not one of those people. Personally I would be stuck at this point. Luckily we live in a society...
It turns out that you can take data like that above and transform it to identify the phase of the light emissions. This ignores the time component of the data making the uneven sampling a nonissue in this frame of reference. Without going into how the transform is achieved here is the result:
This is a much more typical image of what we imagine a curve to be, and it even behaves in such a way that we can easily use it as a profile for the object in question to help classify it. Fantastic! But so what? What was the big deal about that?
It took me quite a bit of reading and hours research to understand the data and context well enough to even comprehend how the above transform works, or why it is the logical approach to light data. As it so happens astronomers featured heavily among the top scorers in this competition, and the top solution was produced by an astronomy grad student. Beyond how to even handle the uneven sampling of the data there are additional issues in the data that only domain knowledge would illuminate. The fact that objects in other galaxies won't be subject to light extinction like those in our own milky way, that redshift will be significant for extra-galactic objects while relatively insignificant for intra-galactic objects, flux of light being influenced by redshift and needing to correct for that; all of these are integral features to the best performing models in this competition, and all of them are beyond the scope of anyone without the appropriate domain knowledge. Can you figure these things out by brute force? Probably. But how much time will it take? Would you be able to explain the why and how to someone else when they scrutinize your approach?
While it is not impossible to adequately tackle most problems without domain knowledge, it requires exponentially more resources in both time and energy to come to the same or even similar results to someone in possession of the appropriate knowledge. Without looking at the work of others and extensive reading I would have no idea how to effectively move forward with this project or even what considerations would need to be taken into account to sensibly approach the problem. Definitely bit off more than I could chew with this competition, and it seems to be a common challenge with many real world physics problems.
Understanding your problem is vital to properly solving it, and the more knowledge you have the better equipped you will be. Lean on those with knowledge you need and know your own limitations.