πŸ” AlgoDetect: Precision Code Structure Identification with CodeBERTa

AlgoDetect is a conceptual project showcasing a novel method for identifying fundamental algorithmic constructs within unanalyzed source code. This approach ingeniously combines the intrinsic structural details found in a program's Abstract Syntax Tree (AST) with the advanced understanding provided by CodeBERTa, a language model expertly trained on code. Our aim is to offer robust classification of code patterns, especially effective even with limited, high-quality datasets.

πŸ’‘ Why This Matters: The Code Analysis Challenge

Machine learning's application to source code faces unique hurdles. Unlike natural language, code possesses a complex, hierarchical syntax that goes far beyond a simple sequence of words. Conventional sequential models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs), despite their ability to manage vanishing gradients, largely treat code as flat token streams. This often prevents them from truly grasping long-range or deeply nested dependenciesβ€”think matching parentheses or nested loops. While Transformers (like BERT or GPT) offer improvements in handling longer sequences and parallelism through self-attention, they too traditionally operate on linear token flows, frequently missing critical structural insights.

To overcome these inherent limitations, AlgoDetect harnesses CodeBERTa. This Transformer model, akin to RoBERTa, has been specifically pre-trained on vast code repositories. CodeBERTa's tokenizer, meticulously designed for code using byte-level BPE on GitHub data, tokenizes code with remarkable efficiency, often producing sequences 33-50% shorter than general NLP models. Our compact CodeBERTa model (featuring 6 layers and 84 million parameters), trained from scratch on two million functions, brings a profound understanding of code semantics. By fusing AST-driven preprocessing with CodeBERTa's rich embeddings, our methodology adeptly captures both the syntactic and semantic nuances necessary for accurate code pattern recognition.

🚫 The Shortcomings of General-Purpose LLMs in Code Analysis

Most large language models (including RNNs, LSTMs, and generic Transformers) are architected for linear, human language. Code, in contrast, is fundamentally hierarchical and syntactic. This design mismatch leads to several issues with traditional models:

✨ CodeBERTa's Edge: Tailored for Code

CodeBERTa stands out as a RoBERTa-style Transformer fine-tuned specifically on programming code, offering distinct advantages:

πŸš€ AlgoDetect's Operational Flow

🌳 Processing Sequence

sequenceDiagram
    participant Source as Raw Code Input
    participant Parser as AST Generation
    participant Refiner as Structure Refinement
    participant Tokenizer as CodeBERTa Tokenizer
    participant Encoder as Transformer Encoder
    participant Classifier as Prediction Module

    Source ->> Parser: Code snippet for analysis
    Parser ->> Refiner: Generated AST
    Refiner ->> Tokenizer: Refined AST tokens
    Tokenizer ->> Encoder: Tokenized input for model
    Encoder ->> Classifier: Embeddings for classification
    Classifier -->> Source: Predicted Code Structure
    

πŸ—οΈ Architectural Overview

graph TD;
    A[Raw Code Snippet] --> B(AST Creation Module)
    B --> C(AST Pruning & Annotation)
    C --> D[Flattened/Annotated AST Tokens]
    D --> E[CodeBERTa Tokenizer Input]
    E --> F[CodeBERTa Transformer Model (Encoder)]
    F --> G[Algorithmic Structure Classifier Output]
    

πŸ“Š Model Configuration & Training

πŸ“ Data Preparation & Preprocessing

🌱 Illustrative AST Snippet

def add_numbers(num1, num2):
    result = num1 + num2
    return result

Resulting AST Representation:

Module
└── FunctionDef (add_numbers)
    β”œβ”€β”€ arguments (num1, num2)
    β”œβ”€β”€ Assign (result)
    β”‚   └── BinOp (Add)
    └── Return (result)

πŸ“‹ Dataset Snapshot

Code Sample Assigned Label
Recursive Fibonacci computation FibonacciSequence
Dijkstra's with while loop DijkstraAlgorithm
BFS using HashMap BreadthFirstSearch

πŸ“ˆ Performance & Future Directions

✨ Core Tenets of AlgoDetect

πŸ“‚ Project Directory Layout

AlgoD-CodeStructure-Identifier/
β”œβ”€β”€ README.md
β”œβ”€β”€ dataset/
β”‚   └── AlgosVersion2.csv
β”œβ”€β”€ notebook/
β”‚   └── dissertation.ipynb
β”œβ”€β”€ code/
β”‚   β”œβ”€β”€ ast_parser.py
β”‚   └── model_inference.py

πŸ§‘β€πŸ’» Project Lead

Arpan Gupta

πŸŽ“ MSc AI & Robotics, University of Glasgow

πŸ”— GitHub Profile