
The primary goal of this project is to develop a machine learning-based classification system capable of identifying the programming language used in a code snippet. With the proliferation of open-source contributions and multi-language development projects, it's essential to automatically detect the language to enable proper code highlighting, toolchain selection, and error checking. The model will rely on Natural Language Processing (NLP) techniques to analyze syntax, keywords, and structural patterns from the code and classify it into predefined programming languages such as Python, Java, C++, and JavaScript. By the end of the project, students will deliver a working model capable of accurately identifying programming languages in real-time scenarios such as code editors or learning platforms.
The project follows a structured twelve-week development timeline. In the early stages, students will set up their development environment using Python, Anaconda, or Google Colab, and explore essential libraries such as NLTK, Scikit-learn, and SpaCy. They will collect and curate datasets consisting of code samples across various popular programming languages.
The middle weeks will focus on preprocessing data (removing comments, formatting, etc.), extracting features from code, and training models such as Naive Bayes and Support Vector Machines (SVM) for classification. The model will be evaluated using accuracy and confusion matrix scores. Once a basic model is validated, students will work on improving performance through hyperparameter tuning and expanded datasets. The final weeks are reserved for full model integration, documentation, and team presentation. While the project limits itself to common programming languages, it provides a comprehensive understanding of real-world NLP and ML applications.