AI Codebase Size: An Exploration

Let's start by explaining the heading. Why am I writing this? At first, I was thinking that AI, basically ChatGPT do have large codebases, like other technologies, similar to what we have as a full-stack developers.

If you're also curious to know, then let's dive into it…

First of all, let’s discuss what AI is? It’s like a human. It is capable of doing things for which it is trained, like us. For example, a cloud engineer can’t do full-stack development, or a full-stack developer can’t write smart contracts unless he/she learns that technology. In the same way, a model (let’s treat it as a small subset of AI) can’t do things for which it is not trained.

So, how much time does it take for a human being to come into existence and do things? Approximately 20 years (I am just using graduation as a standard). How long does it take to build a human being? Nine months only. So, a human takes almost 19 years just to learn things and make their rules, and based on those rules, they make decisions in the future. Similarly, AI also takes approximately only 20% of the time to develop or create it. The remaining 80% of the time goes into training and testing data, and before that, correcting the data in terms of removing bias, correcting errors, and eliminating mistakes.

A single person can develop ChatGPT. All one needs is a very powerful (and, of course, expensive) processor (so that your model can train well) and a good knowledge of all the algorithms. Yes, you heard it right. The core team of ChatGPT during the development phase was:

Researchers & Scientists – AI/ML experts who designed the transformer models (~10-20 people).
Engineers – Software, ML, and DevOps engineers who implemented and optimized the model (~10-30 people).
Data Labelers & Trainers – People involved in collecting and fine-tuning data (hundreds, but often outsourced).

So, you can see it is only 10 to 20 people, and a large number of people are involved in collecting and fine-tuning the data. Engineers are responsible for making the platform available in an easy manner.

So, AI is not developed solely by computer scientists or engineers. A bigger contribution comes from mathematicians who develop algorithms, and engineers convert these algorithms into code so that machines can understand them. Basically, an algorithm is a rule, based on which decisions are made. There are various algorithms like ViT, linear regression, CNN, and so on. These models/algorithms are developed by engineers or researchers from big companies and open-source communities, which is purely mathematical thinking. We use these models to perform tasks, like disease detection using ViT or CNN. You can use any model (which is freely available without a license, like a language; you don't have to mention anyone to use C++) for this task (some require licenses, e.g., GPT-4, Claude).

Since every model has its own pros and cons, every model’s accuracy will be different. After analyzing all the models, we will know which model is best for our task. For example, CNN was earlier the best fit for objects and images, but now transformers are also doing well in this domain, which was earlier used for NLP. Detecting diseases using the model is just 100 lines of code. And you know what, you have built a model like ChatGPT in just 100 lines of code.

So, training takes time, not the development.

GPT-4 is an AI model (a large language model based on transformers), but OpenAI provides it as SaaS. It likely uses custom architectures beyond just ViT or CNN.
Building a ChatGPT-like model is possible for individuals with expertise and compute power, but training it at OpenAI’s scale requires massive data, GPUs, and fine-tuning.
AI development is more than just training in terms of optimizing architectures, inference efficiency, and reinforcement learning from human feedback (RLHF).
GPT is based on transformers, but heavily modified and optimized—similar to how WhatsApp modified Socket.io and Zoom modified websocket
AI models don’t have massive codebases like MERN apps; most of the effort goes into data collection, cleaning, and training rather than just writing code.
Training takes much more time than model development, just like how a human takes years to learn.
An AI engineer should know algorithms, their differences, and how data (text, images, audio) is stored and processed in computers—this helps in choosing the right model for a task.

Conclusion

AI model development is fast, but training and data preparation take most of the time. Developing GPT or writing algorithm code to create GPT or DeepShake does not take much time, but testing GPT does take time. AI models do have significant code, but it's not like a full-stack MERN app. The core model might be 1000+ lines, but the real challenge is optimization, data, and training.

Thank you for reading my article! Please check out my other articles as well.

Exploring the Size of AI Software Codebases.

How Large Are AI Software Codebases?

Let's start by explaining the heading. Why am I writing this? At first, I was thinking that AI, basically ChatGPT do have large codebases, like other technologies, similar to what we have as a full-stack developers.

Conclusion