DEV Community: Koo Ping Shung

Becoming a Data Scientist - Part 1

Koo Ping Shung — Wed, 08 Apr 2020 02:00:03 +0000

An FAQ I get most of the time at a meetup, seminar or training session is, “Given my XXX (to list the common ones, they are Computer Science, Statistics, Engineering, Economics) background, how do I get started on Data Science? How do I build up my skills and knowledge so I can embark on Data Science as a career?”

So I decide to write several posts here that can help individuals to keep tab on their Data Science skills/knowledge inventory.

From the macro view, I usually show the following Venn diagram to help with understanding on the skills/knowledge that is needed.

There are a lot of Venn diagrams out there that describe what Data Science is, here is a list of them.

The thoughts behind my Venn diagram is to help people understand the skills and knowledge that are needed, to guide people on becoming a data scientist and I wanted to be as precise as possible so that readers can be more focused in their learning journey. Thus you may find it “cleaner” compared to other Venn diagram that you have seen.

Venn Diagram

There are three components of the Venn diagram:

1- Data & IT Management

2- Mathematical Models

3- Domain Expertise

Data & IT Management

Being a data scientist, we have to advise on a few areas in the IT and Data Infrastructure, areas such as how to handle missing values, can data be captured at a more granular level, how to improve on data quality, how to implement the scorecard into existing systems etc. With a good understanding of the Data & IT Infrastructure, we can then proposed constructive suggestions on managing data and using the models that we have built. Through practical suggestion, data science can continue to add value and flourish in an organization.

Mathematical Models

Mathematical models would need no explanation. It is essential for data scientist to know and understand it. I will like to point out there is a need to consider computation complexity and not a one way street into “highest accuracy” ville.

Just in case you are wondering, statistics is also included as well.

Domain Expertise

So what about domain expertise? Well, previously I put the circle as "Business Expertise" rather but as the experience accumulates, I notice that NGOs and Charities are beginning to tap onto their existing data to make the donations or causes go longer. Thus I decided to change it to “domain expertise” instead, to correctly reflect the current environment with regards to data science.

Generally, when we decide to build any models, data scientist should think about stakeholder’s reaction to it. For instance, if we build a model that segment students and provide resources to students that are likely to succeed after the segmentation, this would create an uproar among students, especially those classified as “poor”. Thus we would like to structure the business/organization objectives and models in a way that really meets the business objectives without bringing “damages” to other aspect of business. And that requires good knowledge of how business works, for instance understanding its business model, processes & operations, regulations etc

Another example would be, if we are required to build a recommender system, accuracy would never be the sole consideration in selecting the best model for the tasks. As a data scientist, we would also have to determine the computation complexity of the chosen model as well. Here is a real-life example from Netflix

Conclusion

A good data scientist never stop learning, why is that so? If you look at the three areas that data scientist need to have skills and knowledge in, they are changing everyday. In 2017 - 2018, Hadoop & Spark was mentioned a lot of times, and its an essential skill that data engineer or data scientist should have. Fast forward to 2020, who is talking about them? The infrastructure we are talking about these days is cloud computing.

In the early part of last decade, most people knows Neural Network as having a single computation layer till AlphaGo came into the picture and everyone saw a lot of breakthrough in the Artificial Intelligence front because of Deep Learning, a derivative of Neural Network. This example shows that new machine learning algorithms are being invented and thus the job of a data scientist is to learn and understand them. By the way, we have not touch on quantum computing yet. That will be a whole new paradigm.

I wish you all the best in your Data Science journey. Consider signing up for my newsletter to stay in touch. Feedback are welcome, linked me up at my LinkedIn profile and Twitter! :)

Ethics in Artificial Intelligence: Let's Start Now!

Koo Ping Shung — Fri, 06 Mar 2020 07:38:42 +0000

Last July, I had the privilege to attend a panel discussion on Ethics in Artificial Intelligence. It was an interesting discussion and I felt that such discussion can be pretty abstract for the non-technical folks.

There are many cases where AI has been misused intentionally (such as DeepNude) and unintentionally (biased job applicants selection ). At the end of the day, we need to realize that AI is a tool and how it is used, for harm or the greater good really depends on the individual. As Uncle Ben once said,“With great powers (coding, machine learning and artificial intelligence), comes great responsibilities”.

Here are my thoughts after the panel discussion.

Most governments have prepared guidelines on using Artificial Intelligence by now. These guidelines will definitely form good “reference” materials for any companies or individuals who wants to understand more about ethical usage of Artificial Intelligence. Organizations looking to prepare their guidelines, will need to start arming themselves with the technical knowledge of AI, how it works and may work, what are the near-term possibilities and what are the “impossibles”. But guidelines at the end of the day can only reduce the “unintentional”. To reduce the “intentional”, society needs to have a strong regulatory and law framework, requiring any regulators to have a thorough understanding on how AI works otherwise we may be punishing the wrong actors.

Second thing, if one has observed history, during Enron and Tyco saga (early 2000s), where accounting processes were abused, there were discussion on Ethics in business managers. Accounting process was created to help management and investors understand in detail where the business is heading. It was never meant to paint a false picture.

Moving on to 2008, during the Financial Crisis, again there were discussion on Ethics, this time round on the bankers. Complex investment products were pushed and sold to the public, business organizations and pension funds, resulting in huge losse. Retail investors were hit the most and now having to rebuild their retirement nest eggs.

In this third rise of the AI, we are now moving on to say that there needs to be Ethics in ML and AI community (the engineers and scientists). This time round greed is not the cause anymore but rather ignorance, which is not difficult to overcome. We just need more education and critical thinking to avoid negative usage of AI.

I believed there are two things that we need to do right NOW:

1) Educate the public on what Artificial Intelligence and Data Science is about, what kind of information can we (Data and AI Scientist) get from your data. As individuals, we need to start realizing that our data has value and it can be abused if not protected well.

If you are in Singapore, a good start is to go for “AI 4 Everyone” conducted by AISingapore. Checkout their “Events” page to find more details.

If you are not in Singapore, start looking out for tech communities, listen to their talks to understand what they do with data, ask questions if needed. If all else fails, Google is your best teacher. :)

2) We, the AI and DS community need to start realizing that our work can have an impact on our customers and general public. We have to start asking ourselves if the impact that we are creating, although may have helped the business overall, is there a "harmful" impact to the other stakeholders as well, namely the customers.

There are many places that can go ‘wrong’, for instance biased data, models that are extremely difficult to interpret (like deep learning, random forest, support vector machine), and how decisions are made in high stake situation such as healthcare and education.

The tech community should start putting ourselves in the shoes of the customers, and empathize how the customers feel at the “receiving end” of the models that we build.

Conclusion

At the end of the day, I do not think Ethics is just applicable to the tech community (at the moment) but to everybody; to start thinking more about the impact of our actions, both personal and at work. The global conversation /discussion on ethics in artificial intelligence should definitely start now to raise awareness, both in the public and the tech community and it will definitely take time. Another important point to note is that these Ethics discussion will never have a conclusion. But having a conclusion is not what we want to achieve but rather having the awareness, knowledge and empathy to consider from many angles/views.

I hope the post has been useful to you. Feedback and discussions are welcome as long as we take the approach of NOT CONVINCING but more of considering different viewpoints. :)