1. What are the different types of Sampling?
Ans: Some of the Common sampling ways are as follows:
- Simple random sample: Every member and set of members have an equal chance of being included in the sample. Technology, random number generators, or some other sort of change process is needed to get a simple random sample.
Example—A teacher puts students’ names in a hat and chooses without looking to get a sample of students.
Why it’s good: Random samples are usually fairly representative since they don’t favor certain members.
- Stratified random sample: The population is first split into groups. The overall sample consists of some members of every group. The members of each group are chosen randomly.
Example—A student council surveys 100100100 students by getting random samples of 252525 freshmen, 252525 sophomores, 252525 juniors, and 252525 seniors.
Why it’s good: A stratified sample guarantees that members from each group will be represented in the sample, so this sampling method is good when we want some members from every group.
- Cluster random sample: The population is first split into groups. The overall sample consists of every member of the group. The groups are selected at random.
Example—An airline company wants to survey its customers one day, so they randomly select 555 flights that day and survey every passenger on those flights.
Why it’s good: A cluster sample gets every member from some of the groups, so it’s good when each group reflects the population as a whole.
- Systematic random sample: Members of the population are put in some order. A starting point is selected at random, and every nth member is selected to be in the sample.
Example—A principal takes an alphabetized list of student names and picks a random starting point. Every 20th student is selected to take a survey.
2. What is the confidence interval? What is its significance?
Ans: A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method. A confidence interval can take any number of probabilities, with the most common being a 95% or 99% confidence level.
3. What are the effects of the width of the confidence interval?
- The confidence interval is used for decision making
- The confidence level increases the width of
- The confidence interval also increases
- As the width of the confidence interval increases, we tend to get useless information also.
- Useless information – wide CI
- High risk – narrow CI
4. What is the level of significance (Alpha)?
Ans: The significance level also denoted as alpha or α, is a measure of the strength of the evidence that must be present in your sample before you will reject the null hypothesis and conclude that the effect is statistically significant. The researcher determines the significance level before conducting the experiment.
The significance level is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. Lower significance levels indicate that you require stronger evidence before you will reject the null hypothesis.
Use significance levels during hypothesis testing to help you determine which hypothesis the data support. Compare your p-value to your significance level. If the p-value is less than your significance level, you can reject the null hypothesis and conclude that the effect is statistically significant. In other words, the evidence in your sample is strong enough to be able to reject the null hypothesis at the population level.
5. What are Skewness and Kurtosis? What does it signify?
Ans: Skewness: It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in the data distribution. It differentiates extreme values in one versus the other tail. The asymmetrical distribution will have a skewness of 0.
There are two types of Skewness: Positive and Negative
Positive Skewness means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode.
Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode.
So, when is the skewness too much?
The rule of thumb seems to be:
- If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
- If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.
- If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.
Let us take a very common example of house prices. Suppose we have house values ranging from $100k to $1,000,000 with the average being $500,000.
If the peak of the distribution was left of the average value, portraying a positive skewness in the distribution. It would mean that many houses were being sold for less than the average value, i.e. $500k. This could be for many reasons, but we are not going to interpret those reasons here.
If the peak of the distributed data was right of the average value, that would mean a negative skew. This would mean that the houses were being sold for more than the average value.
Kurtosis: Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.
High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high kurtosis, then, we need to investigate why do we have so many outliers. It indicates a lot of things, maybe wrong data entry or other things. Investigate!
Low kurtosis in a data set is an indicator that data has light tails or a lack of outliers. If we get low kurtosis(too good to be true), then also we need to investigate and trim the dataset of unwanted results.
Mesokurtic: This distribution has kurtosis statistics similar to that of the normal distribution. It means that the extreme values of the distribution are similar to that of a normal distribution characteristic. This definition is used so that the standard normal distribution has a kurtosis of three.
Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. The peak is higher and sharper than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.
Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic distribution.
Platykurtic: (Kurtosis < 3): Distribution is shorter; tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that data are light-tailed or lack of outliers. The reason for this is because the extreme values are less than that of the normal distribution.
6. What are Range and IQR? What does it signify?
Ans: Range: The range of a set of data is the difference between the highest and lowest values in the set.
IQR(Inter Quartile Range): The interquartile range (IQR) is the difference between the first quartile and the third quartile. The formula for this is:
IQR = Q3 – Q1
The range gives us a measurement of how spread out the entirety of our data set is. The interquartile range, which tells us how far apart the first and third quartile is, indicates how to spread out the middle 50% of our set of data is.
7. What is the difference between Variance and Standard Deviation? What is its significance?
Ans: The central tendency mean gives you the idea of an average of the data points( i.e center location of the distribution) And now you want to know how far are your data points from mean So, here comes the concept of variance to calculate how far are your data points from mean (in simple terms, it is to calculate the variation of your data points from mean)
Standard deviation is simply the square root of variance. And the standard deviation is also used to calculate the variation of your data points (And you may be asking, why do we use standard deviation when we have variance. Because in order to maintain the calculations in same units i.e suppose mean is in 𝑐𝑚/𝑚, then the variance is in 𝑐𝑚2/𝑚2, whereas standard deviation is in 𝑐𝑚/𝑚, so we use standard deviation most)
8. What is selection Bias? Types of Selection Bias?
Ans: Selection bias is the phenomenon of selecting individuals, groups, or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.
Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.
Types of selection bias include:
- Sampling bias: a biased sample caused by non-random sampling
- Time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
- Exposure: includes clinical susceptibility bias, protopathic bias, indication bias. Read more here.
- Data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
- Attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
- Observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it.
Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.
9. What are the ways of handling missing Data?
- Delete rows with missing data
- Mean/Median/Mode imputation
- Assigning a unique value
- Predicting the missing values using Machine Learning Models
- Using an algorithm that supports missing values, like random forests.
10. What are the different types of the probability distribution? Explain with example?
Ans: The common Probability Distribution is as follows:
- Bernoulli Distribution
- Uniform Distribution
- Binomial Distribution
- Normal Distribution
- Poisson Distribution
1. Bernoulli Distribution: A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.
Example: whether it’s going to rain tomorrow or not where rain denotes success and no rain denotes failure and Winning (success) or losing (failure) the game.
2. Uniform Distribution: When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution are equally likely.
Example: Rolling a fair dice.
3. Binomial Distribution: A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and where the probability of success and failure is the same for all the trials is called a Binomial Distribution.
- Each trial is independent.
- There are only two possible outcomes in a trial- either a success or a failure.
- A total number of n identical trials are conducted.
- The probability of success and failure is the same for all trials. (Trials are identical.)
Example: Tossing a coin.
4. Normal Distribution: Normal distribution represents the behavior of most of the situations in the universe (That is why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often turns out to be normally distributed, contributing to its widespread application. Any distribution is known as Normal distribution if it has the following characteristics:
- The mean, median, and mode of the distribution coincide.
- The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
- The total area under the curve is 1.
- Exactly half of the values are to the left of the center and the other half to the right.
5. Poisson Distribution: A distribution is called Poisson distribution when the following assumptions are valid:
- Any successful event should not influence the outcome of another successful event.
- The probability of success over a short interval must equal the probability of success over a longer interval.
- The probability of success in an interval approaches zero as the interval becomes smaller.
Example: The number of emergency calls recorded at a hospital in a day.
11. What are the statistical Tests? List Them.
Ans: Statistical tests are used in hypothesis testing. They can be used to:
- determine whether a predictor variable has a statistically significant relationship with an outcome variable.
- estimate the difference between two or more groups.
Statistical tests assume a null hypothesis of no relationship or no difference between groups. Then they determine whether the observed data fall outside of the range of values predicted by the null hypothesis.
Common Tests in Statistics:
- Chi-Square Test
12. How do you calculate the sample size required?
Ans: You can use the margin of error (ME) formula to determine the desired sample size.
- t/z = t/z score used to calculate the confidence interval
- ME = the desired margin of error
- S = sample standard deviation
13. What are the different Biases associated when we sample?
Ans: Potential biases include the following:
- Sampling bias: a biased sample caused by non-random sampling
- Under coverage bias: sampling too few observations
- Survivorship bias: error of overlooking observations that did not make it past a form of the selection process.
14. How to convert normal distribution to standard normal distribution?
Standardized normal distribution has mean = 0 and standard deviation = 1
To convert normal distribution to standard normal distribution we can use the
formula: X (standardized) = (x-µ) / σ
15. How to find the mean length of all fishes in a river?
- Define the confidence level (most common is 95%)
- Take a sample of fishes from the river (to get better results the number of fishes > 30)
- Calculate the mean length and standard deviation of the lengths
- Calculate t-statistics
- Get the confidence interval in which the mean length of all the fishes should be.
16. What do you mean by the degree of freedom?
- DF is defined as the number of options we have
- DF is used with t-distribution and not with Z-distribution
- For a series, DF = n-1 (where n is the number of observations in the series)
17. What do you think if DF is more than 30?
- As DF increases the t-distribution reaches closer to the normal distribution
- At low DF, we have fat tails
- If DF > 30, then t-distribution is as good as the normal distribution.
18. When to use t distribution and when to use z distribution?
- The following conditions must be satisfied to use Z-distribution
- Do we know the population standard deviation?
- Is the sample size > 30?
- CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
- Else we should use t-distribution
- CI = x (bar) – t*s/√n to x (bar) + t*s/√n
19. What are H0 and H1? What is H0 and H1 for the two-tail test?
- H0 is known as the null hypothesis. It is the normal case/default case.
For one tail test x <= µ
For two-tail test x = µ
- H1 is known as an alternate hypothesis. It is the other case.
For one tail test x > µ
For two-tail test x <> µ
20. What is the Degree of Freedom?
DF is defined as the number of options we have:
DF is used with t-distribution and not with Z-distribution
For a series, DF = n-1 (where n is the number of observations in the series)
21. How to calculate p-Value?
Ans: Calculating p-value:
- Go to the Data tab
- Click on Data Analysis
- Select Descriptive Statistics
- Choose the column
- Select summary statistics and confidence level (0.95)
By Manual Method:
- Find H0 and H1
- Find n, x(bar) and s
- Find DF for t-distribution
- Find the type of distribution – t or z distribution
- Find t or z value (using the look-up table)
- Compute the p-value to the critical value
22. What is ANOVA?
Ans: ANOVA expands to the analysis of variance, is described as a statistical technique used to determine the difference in the means of two or more populations, by examining the amount of variation within the samples corresponding to the amount of variation between the samples. It bifurcates the total amount of variation in the dataset into two parts, i.e. the amount ascribed to chance and the amount ascribed to specific causes.
It is a method of analyzing the factors which are hypothesized or affect the dependent variable. It can also be used to study the variations amongst different categories, within the factors, that consist of numerous possible values. It is of two types:
One way ANOVA: When one factor is used to investigate the difference between different categories, having many possible values.
Two way ANOVA: When two factors are investigated simultaneously to measure the interaction of the two factors influencing the values of a variable.
23. What is ANCOVA?
Ans: ANCOVA stands for Analysis of Covariance, is an extended form of ANOVA, that eliminates the effect of one or more interval-scaled extraneous variable, from the dependent variable before carrying out research. It is the midpoint between ANOVA and regression analysis, wherein one variable in two or more populations can be compared while considering the variability of other variables.
When in a set of independent variables consist of both factor (categorical independent variable) and covariate (metric independent variable), the technique used is known as ANCOVA. The difference independent variables because of the covariate are taken off by an adjustment of the dependent variable’s mean value within each treatment condition.
This technique is appropriate when the metric independent variable is linearly associated with the dependent variable and not to the other factors. It is based on certain assumptions which are:
- There is some relationship between the dependent and uncontrolled variables.
- The relationship is linear and is identical from one group to another.
- Various treatment groups are picked up at random from the population.
- Groups are homogeneous in variability.
24. What is the difference between ANOVA and ANCOVA?
Ans: The points given below are substantial so far as the difference between ANOVA and ANCOVA is concerned:
- The technique of identifying the variance among the means of multiple groups for homogeneity is known as Analysis of Variance or ANOVA. A statistical process which is used to take off the impact of one or more metric-scaled undesirable variable from the dependent variable before undertaking research is known as ANCOVA.
- While ANOVA uses both linear and non-linear models. On the contrary, ANCOVA uses only a linear model.
- ANOVA entails only categorical independent variables, i.e. factor. As against this, ANCOVA encompasses a categorical and a metric independent variable.
- A covariate is not taken into account, in ANOVA, but considered in ANCOVA.
- ANOVA characterizes between-group variations, exclusively to treatment. In contrast, ANCOVA divides between-group variations to treatment and covariate.
- ANOVA exhibits within-group variations, particularly individual differences. Unlike ANCOVA, which bifurcates within-group variance in individual differences and covariate.
25. What are t and z scores? Give Details.
T-Score vs. Z-Score: Overview: A z-score and a t score are both used in hypothesis testing.
T-score vs. z-score: When to use a t score:
The general rule of thumb for when to use a t score is when your sample:
Has a sample size below 30,
Has an unknown population standard deviation.
You must know the standard deviation of the population and your sample size should be above 30 in order for you to be able to use the z-score. Otherwise, use the t-score.
Technically, z-scores are a conversion of individual scores into a standard form. The conversion allows you to more easily compare different data. A z-score tells you how many standard deviations from the mean your result is. You can use your knowledge of normal distributions (like the 68 95 and 99.7 rule) or the z-table to determine what percentage of the population will fall below or above your result.
The z-score is calculated using the formula:
- z = (X-μ)/σ
- σ is the population standard deviation and
- μ is the population mean.
- The z-score formula doesn’t say anything about sample size; The rule of thumb applies that your sample size should be above 30 to use it.
Like z-scores, t-scores are also a conversion of individual scores into a standard form. However, t-scores are used when you don’t know the population standard deviation; You make an estimate by using your sample.
- T = (X – μ) / [ s/√(n) ]
- s is the standard deviation of the sample.
If you have a larger sample (over 30), the t-distribution and z-distribution look pretty much the same.
Follow us on:
Watch our Live Session Recordings to precisely understand statistics, probability, calculus, linear algebra, and other math concepts used in data science.
To get updates on Data Science and AI Seminars/Webinars – Follow our Meetup group.
Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.