In your opinion, to train an LLM of 900M parameters based on GPT architecture how many A100 40GB GPUs do I need (if one is enough or i need more on Colab)? And how long would training on a 20GB dataset on Google Colab with "pay as you go" plan (I have a tokenizer that has 35000 tokens) last?