🔍 AlgoDetect: Precision Code Structure Identification with CodeBERTa

AlgoDetect is a conceptual project showcasing a novel method for identifying fundamental algorithmic constructs within unanalyzed source code. This approach ingeniously combines the intrinsic structural details found in a program's Abstract Syntax Tree (AST) with the advanced understanding provided by CodeBERTa, a language model expertly trained on code. Our aim is to offer robust classification of code patterns, especially effective even with limited, high-quality datasets.

💡 Why This Matters: The Code Analysis Challenge

Machine learning's application to source code faces unique hurdles. Unlike natural language, code possesses a complex, hierarchical syntax that goes far beyond a simple sequence of words. Conventional sequential models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs), despite their ability to manage vanishing gradients, largely treat code as flat token streams. This often prevents them from truly grasping long-range or deeply nested dependencies—think matching parentheses or nested loops. While Transformers (like BERT or GPT) offer improvements in handling longer sequences and parallelism through self-attention, they too traditionally operate on linear token flows, frequently missing critical structural insights.

To overcome these inherent limitations, AlgoDetect harnesses CodeBERTa. This Transformer model, akin to RoBERTa, has been specifically pre-trained on vast code repositories. CodeBERTa's tokenizer, meticulously designed for code using byte-level BPE on GitHub data, tokenizes code with remarkable efficiency, often producing sequences 33-50% shorter than general NLP models. Our compact CodeBERTa model (featuring 6 layers and 84 million parameters), trained from scratch on two million functions, brings a profound understanding of code semantics. By fusing AST-driven preprocessing with CodeBERTa's rich embeddings, our methodology adeptly captures both the syntactic and semantic nuances necessary for accurate code pattern recognition.

🚫 The Shortcomings of General-Purpose LLMs in Code Analysis

Most large language models (including RNNs, LSTMs, and generic Transformers) are architected for linear, human language. Code, in contrast, is fundamentally hierarchical and syntactic. This design mismatch leads to several issues with traditional models:

❌ They frequently overlook crucial control flow structures such as loops and conditional statements.
❌ There's a tendency to over-rely on trivial details like variable names, leading to overfitting.
❌ They struggle to differentiate between semantically identical structures expressed with different token arrangements.

✨ CodeBERTa's Edge: Tailored for Code

CodeBERTa stands out as a RoBERTa-style Transformer fine-tuned specifically on programming code, offering distinct advantages:

⚙️ Optimized Tokenization for Code: Features a byte-level BPE tokenizer meticulously crafted for the nuances of GitHub codebases.
🧠 Deep Code Semantics: Pre-training on 2 million functions has instilled a strong grasp of code syntax and semantic patterns.
💪 Resilience with Limited Data: Its combined approach of specialized tokenization and AST integration significantly reduces the risk of overfitting, even with smaller datasets.
🔌 Seamless Integration: Easily accessible and deployable via the HuggingFace `transformers` library.

🚀 AlgoDetect's Operational Flow

🌳 Processing Sequence

sequenceDiagram
    participant Source as Raw Code Input
    participant Parser as AST Generation
    participant Refiner as Structure Refinement
    participant Tokenizer as CodeBERTa Tokenizer
    participant Encoder as Transformer Encoder
    participant Classifier as Prediction Module

    Source ->> Parser: Code snippet for analysis
    Parser ->> Refiner: Generated AST
    Refiner ->> Tokenizer: Refined AST tokens
    Tokenizer ->> Encoder: Tokenized input for model
    Encoder ->> Classifier: Embeddings for classification
    Classifier -->> Source: Predicted Code Structure

🏗️ Architectural Overview

graph TD;
    A[Raw Code Snippet] --> B(AST Creation Module)
    B --> C(AST Pruning & Annotation)
    C --> D[Flattened/Annotated AST Tokens]
    D --> E[CodeBERTa Tokenizer Input]
    E --> F[CodeBERTa Transformer Model (Encoder)]
    F --> G[Algorithmic Structure Classifier Output]

📊 Model Configuration & Training

Core Model: CodeBERTa-small, leveraging HuggingFace's AutoModelForSequenceClassification.
Primary Task: Multi-class classification, aiming to identify specific algorithmic structures.
Loss Function: Cross-entropy, augmented with early stopping to prevent overfitting.
Regularization Strategy: Dropout, monitored via validation accuracy.

📁 Data Preparation & Preprocessing

Data Source: Our primary dataset is `AlgosVersion2.csv`, containing a mix of Java and Python code examples alongside their corresponding structural labels.
AST Generation: Language-specific tools (e.g., Python's `ast` module) are employed for precise AST parsing.
AST Refinement: This crucial step involves extracting and highlighting key structural elements such as loops, conditional statements, API invocations, and variable types.
Unified Tokenization: The combined AST and raw code tokens are then fed into the CodeBERTa tokenizer for model readiness.

🌱 Illustrative AST Snippet

def add_numbers(num1, num2):
    result = num1 + num2
    return result

Resulting AST Representation:

Module
└── FunctionDef (add_numbers)
    ├── arguments (num1, num2)
    ├── Assign (result)
    │   └── BinOp (Add)
    └── Return (result)

📋 Dataset Snapshot

Code Sample	Assigned Label
Recursive Fibonacci computation	FibonacciSequence
Dijkstra's with while loop	DijkstraAlgorithm
BFS using HashMap	BreadthFirstSearch

File Location: `dataset/AlgosVersion2.csv`
Data Format: Each entry comprises a `code` string and its `label`.

📈 Performance & Future Directions

Evaluated Model: Our fine-tuned CodeBERTa-small.
Comparative Analysis: Benchmarked against BERT and LSTM base models.
Evaluation Task: Multi-class classification of code structures.
Key Metrics: Accuracy and detailed confusion matrix analysis.
Methodology: Standard train-test splitting for robust evaluation.
Outcome: The CodeBERTa + AST approach consistently outperforms token-only baselines.
Future Work: Investigations into integrating Data Flow Graphs (DFG) alongside AST, and exploring other GraphCodeBERT variants for enhanced structural understanding.

✨ Core Tenets of AlgoDetect

✅ Committed to understanding real-world code structures and their inherent logic.
🧠 Leverages Abstract Syntax Trees (ASTs) for granular code parsing and intelligent refinement.
🤖 Employs a fine-tuned CodeBERTa-small model specifically for structural classification tasks.
📉 Optimally designed for effectiveness even with small, meticulously curated datasets.
🔍 Provides comprehensive evaluation through structured metrics and insightful classification analysis.

📂 Project Directory Layout

AlgoD-CodeStructure-Identifier/
├── README.md
├── dataset/
│   └── AlgosVersion2.csv
├── notebook/
│   └── dissertation.ipynb
├── code/
│   ├── ast_parser.py
│   └── model_inference.py

🧑‍💻 Project Lead

Arpan Gupta

🎓 MSc AI & Robotics, University of Glasgow

🔗 GitHub Profile