You’ve probably heard of the word “data” several times maybe in school, from the news, in your daily work or profession, stumbled upon a couple of times on the Internet or anywhere you might find yourself and if you are a data scientist, well your entire profession depends on it.
Data is limitless and its present anywhere in the universe, yet using the term data can sometimes be confusing because nearly everyone has an idea of what it means to them.
[My data is not your data 😃]
In computing, data may be in the form of text, documents, images, audio, and video. At its rudimentary level data is a bunch of ones and zeros.
In statistics data is defined as facts or figures from which conclusion can be drawn.
IT professionals will describe data in terms of entities and attributes.
In layman’s terms, data describes a person, place, object, event or concept in the user context or environment with its meaning dependent on its organization.
In computing different organization of 1’s and 0’s means different things,
[0001 = 1 and 0010 = 2].
In biology different sequence of the genome (A, C, G, and T) result in different genetic code which represents different individuals or species.
Listing the purchase history with the identity of a customer represents the purchasing habit of that particular individual.
Your tweets could be a random arrangement of any of the 26 characters in English and spaces. Yet you chose to arrange them in a way to convey meaning.
If data is not put into context it’s of no value to humans or computers. Context is key.
In the context of computing, 0001 is the binary representation of 1.
In the context of Italian, your tweet in English means nothing even though they might contain the same sequence of characters.
Some say that “facts” are things that can be shown to be true, to exist, or to have happened.
Well, I say “Ideally” because data are not always factual. Simply put, data can be wrong. Part or whole data can sometimes represent something entirely different from what you expect or intend to measure. eg. Schoolboy finds a flaw in Nasa Data and Math Error To Cost Maryland $31 Million
Data that is factual or true or serves the needs of the problem domain is sometimes referred to as good data or signal.
Data that is false, or invalid or does not serve the needs of the problem domain is sometimes called bad data or noise.
Data that describes a set (more than one) of data is called metadata and a set of data is often referred to as a dataset.
Let's consider a scenario (circumstance or a particular experiment) where you want to learn about the kinds of passengers with whom you board the same bus/train at your local bus/train station. So you gathered some information about each individual which becomes your dataset. [stalker 😏]
Datasets are typically displayed in tables, as shown below.
A dataset is a set of data identified with a particular experiment, scenario,subject or circumstance.
In the table, rows represent individuals and columns represent variables
From the above we can say that:
Data are pieces of information about individuals organized into variables
By an individual, we mean a particular person or object.
In our scenario, the passengers are the individuals.
Individuals are sometimes called observations, cases, vector or feature vector.
By a variable, we mean a particular characteristic of the individual. In our scenario, the variables are Age, Height, Seat Number, Gender, Class.
Variables are sometimes called observations, variables, or features.
Each row gives us all of the information about a particular individual (in this case a passenger), and each column gives us information about a particular characteristic of all of the passengers.
Data can be classified in many ways and from different perspectives which deserves its own blog but in short, Data can be classified as raw or processed, structured or unstructured and can also be classified as qualitative or quantitative.
If you follow carefully you would realize that there are different ways of naming the same thing which stems from the field of study, preference or mere convention. This can be overwhelming for a beginner or someone new to a particular field but don’t be discouraged. You might already know what a term means. It's all a matter of familiarity. Don’t feel bad to ask or search the Internet.