Writing an LLM from scratch, part 20 – starting training, and cross entropy loss