1. How is logistic regression done?
Logistic regression calculates the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by measuring probability using its underlying logistic function (sigmoid).
2. Explain the steps in making a decision tree
- ‌Take the entire dataset as input
- ‌Measure the entropy of the target variable, as well as the predictor attributes
- ‌Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
- ‌Select the attribute with the highest information gain as the root node
- ‌Repeat the same process on every branch until the decision node of each branch is finalized
3.How do you build a random forest model
?
- ‌A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.
- ‌Here are the steps for building a random forest model:
- ‌Randomly select 'k' features from a total of 'm' features where k << m
- ‌Among the 'k' features, calculate the node D using the best split point
- ‌Split the node into daughter nodes using the best split
- ‌Repeat steps two and three until leaf nodes are finalized
- ‌Build a forest by repeating steps one to four for 'n' times to create 'n' number of trees
4.How to avoid overfitting your model?
Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture.
There are three main methods to avoid overfitting:
- ‌Keep the model simple-take fewer variables into account, thereby removing some of the noise in the training data
- ‌Use cross-validation techniques, such as k folds cross validation
- ‌Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting
5. Differentiate between univariate, bivariate, and multivariate analysis.
Univariate
Univariate data contains only one variable. The aim of the univariate analysis is to describe the data and find patterns that exist within it.
Bivariate
Bivariate data contains two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.
Multivariate
Multivariate data involves three or more variables, it is categorized under multivariate. It is the same as a bivariate but contains more than one dependent variable.
6. You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
- ‌To handle missing data values, we can opt for the following process:
- ‌If the data set is large, we can simply remove the rows with missing data values. It is the fastest way; we use the rest of the data to predict the values.
- ‌For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas' data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).
7.What are dimensionality reduction and its benefits?
Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely. This reduction is useful in compressing data and minimizing storage space. It also minimizes computation time as fewer dimensions lead to less computing. It removes redundant features
8. How should you maintain a deployed model?
Following are the steps to maintain a deployed model:
Monitor: continual monitoring of all models is required to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it's doing what it's supposed to do.
Evaluate: Evaluation metrics of the current model are measured to determine if a new algorithm is required.
Compare: The new models are compared to each other to determine which model performs the best.
Rebuild: The best-performing model is rebuilt on the current state of data.
9.What are recommender systems?
A recommender system predicts how a user would rate a specific product based on their preferences.
It can be split into two different areas:
Collaborative Filtering:
As an example, Last.fm recommends tracks that other users with similar interests play often. This is also commonly seen on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: "Users who bought this also bought..."
Content-based Filtering:
For example, Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to the music.
10. How can you select k for k-means?
We use the elbow method to choose k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where 'k' is the number of clusters.
Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.
Top comments (0)