DEV Community: Brandon Lau

Timely Considerations

Brandon Lau — Fri, 15 Nov 2019 20:39:12 +0000

Forecasting is one of the more complicated challenges to tackle when it comes to data, and before me lays such a challenge, sprawled out on the python embroidered divan of my notebook like one of Leo's proverbial French girls.

But unlike our maritime Romeo time series forecasting isn't as simple as putting charcoal to paper to describe the dimly lit subject before you. There are a number of difficulties that sequentially dependent data presents that simpler datasets aren't concerned with.

Let's take a look at an example of such a series in the ever intriguing Bitcoin data.

This sort of graph isn't unfamiliar to anyone who's ever been in a math class, and it turns out that much of the data in the world has properties that are time dependent. The graph shows a significant amount of volatility, trends in both the positive and negative, and no clear periodicity. How would one tackle a forecasting problem using data such as this?

Because of the auto-correlation between the current observation and past observations there are several features that become very important to approaching time series:

Level - What are the baseline values of my series if it were "flattened?"
Trend - Is there an overall positive/negative drift to my data over time?
Seasonality - Are there cyclical patterns in my data, either regular or irregular?

Possible Approaches

Model-less

The most naive take is to make predictions off of a model-less approach. Simply take the mean/median (depending on the structure and volatility of your data) at certain times in the past to make predictions for similar times in the future. While simplistic this approach often provides reasonable results and can actually be difficult to surpass with more complicated methods.

ARIMA and its offspring

One of the simplest models for time series modeling is the ARIMA model. ARIMA stands for auto-regressive integrated moving average and is fairly adept at handling simple time-series. However ARIMA makes several assumptions that may not be true of real world data, namely:

The data is stationary
The data is univariate

It also helps if your data is relatively clean of missing values and extreme outliers. Obviously most real world data violates these assumptions to one degree or another, though there are ways to manipulate your data to accommodate said violations. For stationarity you can difference the model, subtracting previous terms from current terms to remove seasonality and trend. For volatile data like the Bitcoin shown above you may be able to break the data set into smaller periods and apply ARIMA over sub-periods and later combine to provide better overall prediction. There are variations on the basic ARIMA structure such as SARIMA (seasonal-ARIMA) that attempt to account for data that does not naturally follow the assumptions of the base model.

Nets

As the popularity of neural nets continues to grow it is only natural that they would be applied to time-series. As it turns out there are several fields where NN's are the best approach, or provide significant boosts over classical approaches. Neural nets have the potential to interpret structures and meaning in our data that may be difficult to extract using classical methods.

Language - Sentences and the meaning of the words within them are inherently dependent on sequence, and neural nets have proven a powerful tool in capturing that information in tasks such as translation and word generation
Audio - Whether it be music, sounds, or spoken word, audio data is also sequentially dependent. The notes that precede each-other determine or at least heavily influence the notes that come after.
Visual - In the field of computer vision images are processed as sequences of encoded pixels, transforming something that may not intuitively be a sequence into a time series. The order of the pixels determine the overall "meaning" of an image, and to interpret them out of order would ignore much of the inherent information in a picture. Neural nets are integral to this field of study as classical methods were previously unable to make much headway in the past.

While a well constructed neural net may be able to provide results in any given application they are both computationally and temporally expensive approaches and may be wholly unnecessary if they provide only slight improvements over much simpler methods.

Considerations

As with any modeling approach there are less obvious factors that one must take into account when dealing with time series.

How much data do I have to work with?
Depending on the density/quantity of your data and how far in the future you are attempting to predict this question can be vital. Generally the more data you have at your disposal and the smaller the horizon of your prediction the less of an issue this will be, but in the real world you are often forced to work with limited data and may be required to make predictions far enough out that you may feel it better to just pray to RNG-sus for the answer. This can also make the traditional train-test split used to evaluate the efficacy of a model a significant challenge

What's the uncertainty of my model?
Sometimes the accuracy of a prediction is less important than the confidence you have in that prediction. Spot on predictions are nearly impossible in the real world given the nigh infinite influences that may be present within or outside of your data, and so the uncertainty of your predictions because an important detail. In the financial world the uncertainty is sometimes even the focus of your modeling as opposed to a point forecast.

How often do I need to retrain my model?
Many time-series problems involve data that is highly variable and shows significant differences in behavior over time. Image data is one instance where this is less of a factor, as most objects maintain their overall appearance over time (cats twenty years ago look much like cats today, and probably cats 20 years in the future). However financial data does not typically have this feature. This requires that effective models be retrained on new data on a regular basis. In order for the model to correct its outputs to reflect changes in behavior or new external pressures it will need to see up to date info that wasn't present in your original training. Otherwise you end up with predictions that eventually shift to a constant, like this:

All that to be said, time-series are a relatively complicated challenge to tackle and requires a different set of approaches that non-time series datasets may not require.

The Borders of Your Domain (Knowledge)

Brandon Lau — Tue, 05 Nov 2019 20:10:43 +0000

As a newcomer to the field of data science it is pretty incredible to see the power of a simple algorithm in understanding the world around us. Classification, prediction, optimization, generation; all these things are within the realm of a few lines of code. At first glance it seems that throwing the right kind of statistical methods and algorithms at a problem will be sufficient to get the desired solution. At least that's the naive understanding one can find oneself operating with initially.

I recently attempted to throw my proverbial hat into the ring of the PLAsTiCC astronomy classification competition hosted by Kaggle about a year ago. Given a set of (projected) telescope readings of various light sources over time, could you classify said sources? While the task clearly isn't as straight forward as it may seem (what competition pays out for easy solutions?) it does appear somewhat approachable. Or so I thought... dramatic pause

Looking over the data and a few discussions by the original competitors it became apparent that the data centered around ~1.5 million objects represented by multivariate time series. My own understanding of time series is fairly limited, barely extending beyond simple data manipulation to allow for the application of basic ARIMA models. But surely the problem could be broken down to something simpler, something manageable that can be solved with fundamental approaches. Only one way to find out, right? Let's look at the data.

Object Data

Meta Data

An explanation of each of the variables can be found here, but just reading up on all of them to have even an elementary grasp of the context of the problem took me quite some time. The high level explanation of what is presented in the data is that a telescope to be built in the future is expected to make certain observations across certain patches of the sky at various times, recording the light detected in different channels (thing RGB) and how much that light fluctuates.

So if you're anything like me just trying to grasp all of those variables at a conceptual level is dizzying. As a non-astronomer it took quite a lot of reading to get up to even a rudimentary understanding of the implications of many of the factors presented. What is stated in the competition is that all of these variables combine to give you what is called a light curve, what is essentially our time series. If you monitor the light given off by an object over time you can develop a profile that can help identify the nature of said object.

Spikes in the curve would indicate growing brightness, while dips indicate a dimming. So all we should need to do is plot the observations and group the profiles, right? Right, so let's do that.

Well that sure as Schrodinger doesn't look like a curve. How are you supposed to extrapolate any kind of time series from that? Turns out the telescope is rotating along with the earth (who'da thunk it?) and only has certain windows in which it can observe a given object. So what we have are unevenly sampled time series with large gaps in the data. Now, more experienced practitioners, or just smarter people than myself may have a natural intuition as to how this problem can be circumvented. Alas, I am not one of those people. Personally I would be stuck at this point. Luckily we live in a society...

It turns out that you can take data like that above and transform it to identify the phase of the light emissions. This ignores the time component of the data making the uneven sampling a nonissue in this frame of reference. Without going into how the transform is achieved here is the result:

This is a much more typical image of what we imagine a curve to be, and it even behaves in such a way that we can easily use it as a profile for the object in question to help classify it. Fantastic! But so what? What was the big deal about that?

It took me quite a bit of reading and hours research to understand the data and context well enough to even comprehend how the above transform works, or why it is the logical approach to light data. As it so happens astronomers featured heavily among the top scorers in this competition, and the top solution was produced by an astronomy grad student. Beyond how to even handle the uneven sampling of the data there are additional issues in the data that only domain knowledge would illuminate. The fact that objects in other galaxies won't be subject to light extinction like those in our own milky way, that redshift will be significant for extra-galactic objects while relatively insignificant for intra-galactic objects, flux of light being influenced by redshift and needing to correct for that; all of these are integral features to the best performing models in this competition, and all of them are beyond the scope of anyone without the appropriate domain knowledge. Can you figure these things out by brute force? Probably. But how much time will it take? Would you be able to explain the why and how to someone else when they scrutinize your approach?

While it is not impossible to adequately tackle most problems without domain knowledge, it requires exponentially more resources in both time and energy to come to the same or even similar results to someone in possession of the appropriate knowledge. Without looking at the work of others and extensive reading I would have no idea how to effectively move forward with this project or even what considerations would need to be taken into account to sensibly approach the problem. Definitely bit off more than I could chew with this competition, and it seems to be a common challenge with many real world physics problems.

Understanding your problem is vital to properly solving it, and the more knowledge you have the better equipped you will be. Lean on those with knowledge you need and know your own limitations.

How To Raise A Model

Brandon Lau — Fri, 18 Oct 2019 19:48:49 +0000

According to Urban Dictionary, the act of being "learnt" is to be turnt on knowledge, or to be under the influence of education. As fate would have it the entire goal of machine learning is to get our models to be learnt, and the controlled substance of choice is the data we feed those models. As with most things any inherent pitfalls can be mitigated through oversight and leaning on the experience of those who have already become learnt. Here we will look at the different real world approaches that line up with the nonsensical metaphorical framework I have previously constructed.

Supervised Learning

As the name implies, supervised learning is akin to having a parent or teacher holding your hand through most of the learning process. A teacher walking the class through example problems, or taking your child camping and teaching them how to start fires, pitch tents, and identify scat. You end up with very accurate understandings of the world, but are dependent on the instructor's prior knowledge. In the world of machine learning this manifests itself in labeled data. If you are lucky information has been correctly labeled beforehand, but many times you will work with data that is either incorrectly labeled or entirely unlabeled. At that point you have to go through your data and assign labels manually. This can be done through broad spectrum brute force algorithms, or by leaning on a human expert who can manually apply their domain knowledge to the data to derive accurate predictions. Supervised learning is essentially the standard approach to building models but is limited by the breadth of its training. It can also be expensive in terms of time and resources to implement properly depending on the quality and nature of the data and problem you are working with.

Unsupervised Learning

Unsupervised learning is equally intuitive, where instead of providing guidance you leave the child alone in the forest with a capri sun and a spork and tell him he has 3 days to get home. In data terms this would be feeding unlabeled data into your models with the intent that they can derive some meaning or new information based on the data's overall structure. This is generally used in tasks such as dimensionality reduction (PCA), clustering, and outlier detection. Some theorize this approach may be the key to true AI in the future since it isn't bound by the training parameters used in supervised learning. This would give models the potential to learn and adapt to novel tasks outside of their original intent, but for now the approach is limited to the aforementioned functions.

Semi-supervised Learning

This is the love child of the above approaches, mixing them in hopes of finding the sweet spot of investment vs profit. Raising the kid the right way then giving them the freedom to explore on their own, trusting that you've instilled the appropriate and necessary values needed to navigate the world. For data purposes this means taking a trained model and using it to evaluate a set of unlabeled data. You then find the predictions your model has the most confidence in and then add those observations to your original data set with the predicted labels (pseudo-labeling).

Initially you cannot be 100% sure these labels are correct, hence the pseudo, but this is also why we restrict our selections to data our model is most confident about, while also iterating multiple times. Rinse and repeat until satisfied. Theoretically this is giving your model more data to work with by utilizing the inherent structures present in your data to amplify your model's predictive power. This of course assumes you can trust the architecture of your model and that your data is relatively well behaved. Strong outliers or incorrect predictions from your original model can derail this process, and one of its disadvantages is that it has no way to self correct.

Ultimately you would use combinations of these approaches to properly and thoroughly analyze data and build robust models.

Pandas-Profiling Aiding Productivity - Python (PPAP...P?)

Brandon Lau — Fri, 27 Sep 2019 19:02:50 +0000

One of the most important axioms in data science that I've come across in my extensive and thorough journey is the idea that 1 x 1 = 2. No, no wait...

Actually it's the idea that good features build good models. Without a solid understanding of the underlying structures and interactions within your data (or lack thereof) you cannot hope to create a meaningful interpretation of the information in question, and thus cannot manipulate or create meaningful features for your model. This is the driving force behind quality EDA exploratory data analysis.

Given the integral nature of EDA concerning the process of data science and effective modeling it is important that it be done well. The initial approach to most projects will generally start with the same methods (discerning ranges of features, looking for missing information, looking for how many different values a feature might contain etc.) and can be done almost entirely with pandas. So what problem is?

While pandas is fully capable it can be cumbersome. Repeating the same functions over and over, regardless of their simplicity, becomes more tedious and time consuming as your data becomes larger and more complex. Each action generally requires separate lines of code, potentially costing you seconds, even minutes of your time. Absolutely outrageous. With Pandas Profiling you can accomplish most of your rudimentary analysis with a single line:

df.profile_report()

This humble command will output a lovely report that tidily answers many of the fundamental questions you would typically have concerning a given data set, and can even be output in HTML format!

Included in the report are:

Essentials: type, unique values, missing values
Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations: Spearman and Pearson matrixes

Another interesting feature is that the report will automatically detect and drop features that it interprets as being negligibly influential.

This may or may not be reliable or even desirable depending on the nature of your data, but potentially useful in helping to identify certain features that may more or less be redundant or contain no information.

Each of these features has a number of more advanced parameters that you can alter when you call pandas-profiling to generate the report, and within the report itself many features are interactive. For instance, in the above correlation matrix example you can see that various measures of correlation can be viewed in individual tabs.

For a more comprehensive look at the various parameters, as well as the source code behind the outputs found in the report, you can check out the GitHub for Pandas Profiling.

This is, in the end, merely a tool of convenience meant to help the user quickly and effectively assess data at a glance. It is in no way a replacement for further probing and analysis, but can be helpful in saving time and allowing for a faster, albeit somewhat shallow understanding of the nature of a data set.

Here is the full report sampled in this blog

How I Chose Data Science

Brandon Lau — Fri, 06 Sep 2019 18:50:33 +0000

I recently began a 15 week program with the Flatiron School with the goal of pursuing a career in data science. With an extremely limited background in programming or anything even tangential this stands to be a long and mind boggling road. So why choose something so seemingly remote from the rest of my experience? Let's take a walk and see... or keep sitting where you are, or don't bother, whatever it's a free country, no judgement.

By financial investment and official certification I am a geologist. Once naively entrenched in the dream of becoming a paleontologist I was all in on being a scientific rock-star. However, toward the end of my undergraduate years I realized academia wasn't for me and on the steps of reality my childhood dream perished like.

I became momentarily lost, unsure of where to go after so many years of a singular goal. But weep not for me my friends, life went on and I soon found new dreams and fresh pursuits. I spent the two years after graduating as a missionary abroad, seeking to bring the opportunity of meaning to others while seeking it myself. A year of this brief whisper in the dark we call life was spent in oil fields of Texas, followed by four years in the wild world of secondary education.

So if you've made it this far you may be asking, "Brandon, what in the name of green bean soup does any of that have to do with data science?" Well my friend, I.... wait a minute, what does any of this have to do with data science?

Oh, right! So as enthralling as this synopsis of my life has been, how did any of that point me in the direction of data? Once I decided to move away from teaching I began to look back on my varied career path to see if there was anything behind me that could help point the way forward. Upon reflection I began to see that through all the different roles I had been in there was one constant: a move toward data driven decision making. As a fairly analytical person with a penchant for puzzles and a proclivity for patterns, coupled with a few suggestions from acquaintances as well as my own explorations, data science seemed like a logical conclusion. The idea of extrapolating and extracting patterns and meaning from apparent nonsense is a fascinating prospect, and I look forward to seeing the full potential of the field. Anything from organizing cladistics in evolution, assessing producibility of a reservoir, to refining methodology and content focus in education, the possibilities are

Anyway, that's all I have to say about that. I guess. Bye.