AlgoDetect is a conceptual project showcasing a novel method for identifying fundamental algorithmic constructs within unanalyzed source code. This approach ingeniously combines the intrinsic structural details found in a program's Abstract Syntax Tree (AST) with the advanced understanding provided by CodeBERTa, a language model expertly trained on code. Our aim is to offer robust classification of code patterns, especially effective even with limited, high-quality datasets.
Machine learning's application to source code faces unique hurdles. Unlike natural language, code possesses a complex, hierarchical syntax that goes far beyond a simple sequence of words. Conventional sequential models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs), despite their ability to manage vanishing gradients, largely treat code as flat token streams. This often prevents them from truly grasping long-range or deeply nested dependenciesβthink matching parentheses or nested loops. While Transformers (like BERT or GPT) offer improvements in handling longer sequences and parallelism through self-attention, they too traditionally operate on linear token flows, frequently missing critical structural insights.
To overcome these inherent limitations, AlgoDetect harnesses CodeBERTa. This Transformer model, akin to RoBERTa, has been specifically pre-trained on vast code repositories. CodeBERTa's tokenizer, meticulously designed for code using byte-level BPE on GitHub data, tokenizes code with remarkable efficiency, often producing sequences 33-50% shorter than general NLP models. Our compact CodeBERTa model (featuring 6 layers and 84 million parameters), trained from scratch on two million functions, brings a profound understanding of code semantics. By fusing AST-driven preprocessing with CodeBERTa's rich embeddings, our methodology adeptly captures both the syntactic and semantic nuances necessary for accurate code pattern recognition.
Most large language models (including RNNs, LSTMs, and generic Transformers) are architected for linear, human language. Code, in contrast, is fundamentally hierarchical and syntactic. This design mismatch leads to several issues with traditional models:
CodeBERTa stands out as a RoBERTa-style Transformer fine-tuned specifically on programming code, offering distinct advantages:
sequenceDiagram participant Source as Raw Code Input participant Parser as AST Generation participant Refiner as Structure Refinement participant Tokenizer as CodeBERTa Tokenizer participant Encoder as Transformer Encoder participant Classifier as Prediction Module Source ->> Parser: Code snippet for analysis Parser ->> Refiner: Generated AST Refiner ->> Tokenizer: Refined AST tokens Tokenizer ->> Encoder: Tokenized input for model Encoder ->> Classifier: Embeddings for classification Classifier -->> Source: Predicted Code Structure
graph TD; A[Raw Code Snippet] --> B(AST Creation Module) B --> C(AST Pruning & Annotation) C --> D[Flattened/Annotated AST Tokens] D --> E[CodeBERTa Tokenizer Input] E --> F[CodeBERTa Transformer Model (Encoder)] F --> G[Algorithmic Structure Classifier Output]
AutoModelForSequenceClassification
.
def add_numbers(num1, num2):
result = num1 + num2
return result
Resulting AST Representation:
Module
βββ FunctionDef (add_numbers)
βββ arguments (num1, num2)
βββ Assign (result)
β βββ BinOp (Add)
βββ Return (result)
Code Sample | Assigned Label |
---|---|
Recursive Fibonacci computation | FibonacciSequence |
Dijkstra's with while loop | DijkstraAlgorithm |
BFS using HashMap | BreadthFirstSearch |
AlgoD-CodeStructure-Identifier/
βββ README.md
βββ dataset/
β βββ AlgosVersion2.csv
βββ notebook/
β βββ dissertation.ipynb
βββ code/
β βββ ast_parser.py
β βββ model_inference.py