There are two types of data we use for Analytics.
- ORGANIC DATA/PROCESS DATA
- ORGANIC DATA: Data which is collected organically over a time period. Like financial or stock market exchanges data or Netflix viewing data which the netflix algorithm collects in order to enhance our recommendations. Data collected by web tracker to personalize our adds. The above few mentioned examples produce a humongous amount of data which led to the term BIG DATA. This organically collected data nowadays is in stupendous quantity and to process this massive datasets we need significant computing resources and we as data scientists mine this data in order to uncover various relations among the variables of the dataset.
- DESIGNED DATA: Now we come to designed data collection, this data is designed my professions to specifically address some research objective for example "Individuals samples from a population who are interviewed about their opinions on a particular topic" or random sample of tweets collected on a particular topic to analysis the views of people towards that topics. Designed data is collected from a small set of a large population through administration of carefully designed questions. Compared to the overall population this dataset comparatively smaller.And unlike above mentioned organic data designed data is collected for specific reasons.
For analysis purpose we need a dataset that is not biased, this unbiased dataset is termed as i.i.d(independent and identically distributed). It signifies that all the observations should be independent of other observations and sound come from common statistical distribution. For example exam scores of students are independent observations and they come from a common normal distribution. Hence when we analyse the data for our analytics we assume that our data is i.i.d and based on this assumption we infer the mean or standard distribution of the data.