About the Project
This project is a comprehensive exploration of text summarization techniques using both traditional Natural Language Processing (NLP) and state-of-the-art transformer models. It demonstrates how to extract relevant text from online sources and generate concise summaries using various methods.
It includes both extractive and abstractive summarization techniques, and compares their performance.
Project Goals
- To explore and implement both extractive and abstractive summarization techniques.
- To evaluate performance of different models on real-world web articles.
- To build a user-friendly interface using Streamlit for quick summarization.
Technologies Used
- Python: Programming language used to implement the project.
- NLTK: Used for traditional NLP operations like tokenization and frequency analysis.
- Scikit-learn: Used for text vectorization (TF-IDF) in extractive summarization.
- Transformers: HuggingFace models like BART and T5 for abstractive summarization.
- BeautifulSoup: Used for scraping and extracting web article content.
- Streamlit: Front-end interface to input URLs and get summaries.
Approach
Extractive Summarization
- Scraped article content using BeautifulSoup.
- Cleaned and tokenized text using NLTK.
- Generated sentence scores using TF-IDF and selected top-n sentences as summary.
Abstractive Summarization
- Used HuggingFace transformer models like BART and T5.
- Fine-tuned model on sample data for improved summaries.
- Generated human-like, concise summaries with contextual understanding.
Results & Insights
- Extractive summaries tend to be longer but accurate to the source.
- Abstractive models provided more readable summaries, closer to human writing style.
- Streamlit interface was intuitive and fast, allowing quick summaries of web articles.