🔍 Built a Python Plagiarism Detection Tool – Combining AST Analysis & TF-IDF

Hey r/Python! 👋

Just finished my first major Python project and wanted to share it with the community that taught me so much!

What it does:

A command-line tool that detects code similarities using two complementary approaches:

AST (Abstract Syntax Tree) analysis – Compares code structure TF-IDF vectorization – Analyzes textual patterns Configurable weighting system – Fine-tune detection sensitivity

Why I built this:

Started as a learning project to dive deeper into Python’s ast module and NLP techniques. Realized it could be genuinely useful for educators and code reviewers.

Target audience:

Students & Teachers – Detect academic plagiarism in programming assignments Code reviewers – Identify duplicate code during reviews Quality assurance teams – Find redundant implementations Solo developers – Clean up personal projects and refactor similar functions Educational institutions – Automated plagiarism checking for coding courses

Scope & Limitations

Compares code against a provided dataset only Not a replacement for professional plagiarism detection services Best suited for educational purposes or small-scale analysis Requires manual curation of the comparison dataset

Simple usage

python main.py examples/test_code/

Advanced configuration

python main.py code/ –threshold 0.3 –ast-weight 0.8 –debug

Detailed confidence scoring and risk categorization Adjustable similarity thresholds Debug mode for algorithm insights Batch processing multiple files

Technical highlights:

Uses Python’s ast module for syntax tree parsing Scikit-learn for TF-IDF vectorization and cosine similarity Clean CLI with argparse and colored output Modular architecture – easy to extend with new detection methods

How it compares

Feature This Tool Online Plagiarism Checkers IDE Extensions Privacy ✅ Fully local ❌ Upload required ✅ Local Speed ✅ Fast ❌ Slow (web-based) ✅ Fast Code-specific ✅ Built for code ❌ General text tools ✅ Code-aware Batch processing ✅ Multiple files ❌ Usually single files ❌ Limited Free ✅ Open source 💰 Often paid 💰 Mixed Customizable ✅ Easy to modify ❌ Black box ❌ Limited

GitHub : https://github.com/rayan-alahiane/plagiarism-detector-py

submitted by /u/Gold-Part2605 to r/Python
[link] [comments]


Commentaires

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *