In most cases, it is not possible to feed all the training data into an algorithm in one pass. This is due to the size of the dataset and memory limitations of the compute instance used for training. There is some terminology required to better understand how data is best broken into smaller pieces.
An epoch elapses when an entire dataset is passed forward and backward through the neural network exactly one time. If the entire dataset cannot be passed into the algorithm at once, it must be divided into mini-batches. Batch size is the total number of training samples present in a single min-batch. An iteration is a single gradient update (update of the model's weights) during training. The number of iterations is equivalent to the number of batches needed to complete one epoch.
So if a dataset includes 1,000 images split into mini-batches of 100 images, it will take 10 iterations to complete a single epoch.
During each pass through the network, the weights are updated and the curve goes from underfitting, to optimal, to overfitting. There is no magic rule for choosing the number of epochs — this is a hyperparameter that must be determined before training begins.
Like the number of epochs, batch size is a hyperparameter with no magic rule of thumb. Choosing a batch size that is too small will introduce a high degree of variance (noisiness) within each batch as it is unlikely that a small sample is a good representation of the entire dataset. Conversely, if a batch size is too large, it may not fit in memory of the compute instance used for training and it will have the tendency to overfit the data. It's important to note that batch size is influenced by other hyperparameters such as learning rate so the combination of these hyperparameters is as important as batch size itself.
A common heuristic for batch size is to use the square root of the size of the dataset. However this is a hotly debated topic.