# ▸ Large Scale Machine Learning :

1. Suppose you are training a logistic regression classifier using stochastic gradient descent. You find that the cost (say, $\inline&space;cost(\theta,(x^{(i)},y^{(i)}))$, averaged over the last 500 examples), plotted as a function of the number of iterations, is slowly increasing over time. Which of the following changes are likely to help?

• Try using a smaller learning rate Î±.

• Try averaging the cost over a larger number of examples (say 1000 examples instead of 500) in the plot.

• This is not an issue, as we expect this to occur with stochastic gradient descent.

• Try using a larger learning rate Î±.

• Use fewer examples from your training set.

• Try halving (decreasing) the learning rate Î±, and see if that causes the cost to now consistently go down; and if not, keep halving it until it does.

• This is not possible with stochastic gradient descent, as it is guaranteed to converge to the optimal parameters Î¸.

• Try averaging the cost over a smaller number of examples (say 250 examples instead of 500) in the plot.

1. Which of the following statements about stochastic gradient descent are true?
Check all that apply.

• Suppose you are using stochastic gradient descent to train a linear regression classifier. The cost function $\inline&space;J(\theta)&space;=&space;\frac{1}{2m}&space;\sum_{i=1}^m&space;(h_\theta(x^{(i)})-y^{(i)})^2$ is guaranteed to decrease after every iteration of the stochastic gradient descent algorithm.

• One of the advantages of stochastic gradient descent is that it can start progress in improving the parameters Î¸ after looking at just a single training example; in contrast, batch gradient descent needs to take a pass over the entire training set before it starts to make progress in improving the parameters’ values.

• Stochastic gradient descent is particularly well suited to problems with small training set sizes; in these problems, stochastic gradient descent is often preferred to batch gradient descent.

• In each iteration of stochastic gradient descent, the algorithm needs to examine/use only one training example.

• Before running stochastic gradient descent, you should randomly shuffle (reorder) the training set.

• In order to make sure stochastic gradient descent is converging, we typically compute $\inline&space;J_{train}(\theta)$ after each iteration (and plot it) in order to make sure that the cost function is generally decreasing.

• You can use the method of numerical gradient checking to verify that your stochastic gradient descent implementation is bug-free. (One step of stochastic gradient descent computes the partial derivative $\inline&space;\frac{\partial&space;}{\partial&space;\theta_j}&space;cost(\theta,(x^{(i)},y^{(i)}))$.)

• If you have a huge training set, then stochastic gradient descent may be much faster than batch gradient descent.

1. Which of the following statements about online learning are true? Check all that apply.

• One of the disadvantages of online learning is that it requires a large amount of computer memory/disk space to store all the training examples we have seen.

• In the approach to online learning discussed in the lecture video, we repeatedly get a single training example, take one step of stochastic gradient descent using that example, and then move on to the next example.

• One of the advantages of online learning is that there is no need to pick a learning rate Î±.

• When using online learning, in each step we get a new example (x, y), perform one step of (essentially stochastic gradient descent) learning on that example, and then discard that example and move on to the next.

• When using online learning, you must save every new training example you get, as you will need to reuse past examples to re-train the model even after you get new training examples in the future.

• Online learning algorithms are most appropriate when we have a fixed training set of size m that we want to train on.

• One of the advantages of online learning is that if the function we’re modeling changes over time (such as if we are modeling the probability of users clicking on different URLs, and user tastes/preferences are changing over time), the online learning algorithm will automatically adapt to these changes.

• Online learning algorithms are usually best suited to problems were we have a continuous/non-stop stream of data that we want to learn from.

1. Assuming that you have a very large training set, which of the following algorithms do you think can be parallelized using map-reduce and splitting the training set across different machines? Check all that apply.

• A neural network trained using batch gradient descent.

• Linear regression trained using batch gradient descent.

• An online learning setting, where you repeatedly get a single example (x, y), and want to learn from that single example before moving on.

• Logistic regression trained using stochastic gradient descent.

• Computing the average of all the features in your training set $\inline&space;\mathbf{\mu&space;=&space;\frac{1}{m}&space;\sum_{i=1}^m&space;x^{(i)}}$ (say in order to perform mean normalization).

• Logistic regression trained using batch gradient descent.

• Logistic regression trained using stochastic gradient descent.

• Linear regression trained using stochastic gradient descent.

### Check-out our free tutorials on IOT (Internet of Things):

1. Which of the following statements about map-reduce are true? Check all that apply.

• When using map-reduce with gradient descent, we usually use a single machine that accumulates the gradients from each of the map-reduce machines, in order to compute the parameter update for that iteration.

• Because of network latency and other overhead associated with map-reduce, if we run map-reduce using N computers, we might get less than an N-fold speedup compared to using 1 computer.

• If you have only 1 computer with 1 computing core, then map-reduce is unlikely to help.

• If we run map-reduce using N computers, then we will always get at least an N-fold speedup compared to using 1 computer.

• Running map-reduce over N computers requires that we split the training set into $\inline&space;N^2$ pieces.

•  In order to parallelize a learning algorithm using map-reduce, the first step is to figure out how to express the main work done by the algorithm as computing sums of functions of training examples.

&
Click here to see more codes for Raspberry Pi 3 and similar Family.
&
Click here to see more codes for NodeMCU ESP8266 and similar Family.
&
Click here to see more codes for Arduino Mega (ATMega 2560) and similar Family.

Feel free to ask doubts in the comment section. I will try my best to answer it.
If you find this helpful by any mean like, comment and share the post.
This is the simplest way to encourage me to keep doing such work.

Thanks & Regards,
- APDaga DumpBox

1. This comment has been removed by a blog administrator.

2. Q5
Question 5
Which of the following statements about map-reduce are true? Check all that apply.
1 point

When using map-reduce with gradient descent, we usually use a single machine that accumulates the gradients from each of the map-reduce machines, in order to compute the parameter update for that iteration.

If you have just 1 computer, but your computer has multiple CPUs or multiple cores, then map-reduce might be a viable way to parallelize your learning algorithm.

Running map-reduce over NN computers requires that we split the training set into N^2N2 pieces.

In order to parallelize a learning algorithm using map-reduce, the first step is to figure out how to express the main work done by the algorithm as computing sums of functions of training examples.

3. Please add the last option: In order to parallelize a learning algorithm using map-reduce, the first step is to figure out how to express the main work done by the algorithm as computing sums of functions of training examples.

1. Done. Thank you very much for your help.