DATA SCIENCE – 1
Can you cite an example in real-life where we might have to make use of the median value, instead of the mean?
In most real-life situation, we make use of the mean and average value to predict an unknown variable e.g., in a roll of six dices the likelihood of any number in the dice appearing is 1/6. However, making use of the mean may not be advisable in certain situations.
Let us consider a scenario of campus recruitment for an outgoing batch in an engineering college/university. Students may have to avail of a limited number of choices (often fixed by the institute) / chance to appear for the selection process of a potential employer. Let us say that number is 3 companies. Also, the institute normally bars an ‘already selected candidate’ from appearing in the other lined-up companies. Suppose a student longs to work in healthcare informatics and analytics. It is quite possible that the best companies to work for in this area are coming to campus towards the end of the recruitment season. Since there is no guarantee of job assurance, how would the candidate devise his strategy, which companies to sit for and which ones to avoid? It is observed that there is a great variance in the pay packet on offer. The best choice for the candidate would be to avoid the highest and the lowest paying companies and appear for companies that offer a mid-range salary. In such situations, median acts as a better measure than mean.
Does strong association always unravel interesting causal relationships?
In the process of Market Basket Analysis, we perform Association rule mining to find items that sell together. Associations that crosses the threshold value of ‘support’ and ‘confidence’ usually uncovers interesting relationships among items that are purchased together by a customer. Sale of computers leading to a sale of security software in significant proportions can be a causal relationship.
At times even though we may discover strong associations, the casual factor can’t be claimed strongly. Data from a particular city may indicate that ‘sale of umbrella’ & ‘number of deaths’ has witnessed an upward trend in recent months. But can we see a causal relationship in this case? It may be noted that ‘sale of umbrella’ has risen owing to rainy season and the surge in the figure of deaths can be as a consequence of floods or any such contributory factors.
Should the sample size be necessarily equal when we are making a comparative study about the potential of two regions for starting a business (for picking the one with better prospects)?
Let us consider a Multinational keen on setting up a tissue business and weighing the prospects of setting it up in an Indian metro city vis-à-vis a European city. Use of the product i.e. , tissue paper is mostly a social one in case of a European city while it may be restricted to the upper segment of society in an Indian metro. Therefore, owing to uniformity, the general sample size for a European city can be less, but owing to this great diversity in the case of an Indian metro, the sample size should be much higher to estimate the true possibilities for the business across the diverse customer segment.
The questions were a part of/ inspired from a lecture of Mr. Gautam Bannerjee of Computer Brio and his talk on ANALYTICS WITH R delivered at Galgotias University, 9th-13th Dec, 2017