BMNet Deepfake Detection Model with BiLSTM and Multi-Head Attention

Abstract
The rapid advancement of deepfake generation techniques has raised serious concerns regarding the authenticity of digital media, particularly images and videos shared on online platforms. Deepfake content often exhibits subtle spatial and temporal inconsistencies that are difficult to detect using traditional image-based methods. To address this challenge, this project proposes a deepfake detection system based on a hybrid Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) architecture. The system processes fixed-length sequences of facial frames to effectively capture both spatial features and temporal dependencies. A pretrained ResNet-18 model is employed as a feature extractor, while the BiLSTM network models frame-to-frame variations to improve detection accuracy. The proposed approach supports both image and video inputs by converting them into frame sequences of uniform length. The model is trained using a binary cross-entropy loss function and evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score. Additionally, the system is deployed through a desktop-based graphical user interface and a web-based application with secure user authentication, enabling real-time deepfake detection. Experimental results demonstrate that the proposed CNN–BiLSTM framework provides reliable and efficient deepfake classification, making it suitable for practical forensic and security applications.
Keywords
Deepfake Detection, Convolutional Neural Network, Bidirectional LSTM, ResNet-18, Video Forensics, Image Classification, Deep Learning, Flask Deployment

Introduction
The rapid evolution of artificial intelligence and deep learning technologies has significantly transformed the way digital media is created, shared, and consumed. Images and videos have become primary sources of information across social media platforms, news channels, entertainment industries, and forensic investigations. However, this technological progress has also given rise to serious security and ethical challenges. One of the most concerning developments is the emergence of deepfake technology, which enables the creation of highly realistic but manipulated images and videos using advanced machine learning models. Deepfake media can convincingly alter facial expressions, speech, and actions of individuals, making it extremely difficult for humans to distinguish between genuine and forged content.
Deepfakes are typically generated using deep learning architectures such as Generative Adversarial Networks (GANs) and autoencoders, which learn complex facial representations from large datasets. While these techniques have legitimate applications in fields such as film production and virtual reality, they are increasingly misused for malicious purposes. Deepfake videos have been exploited for spreading misinformation, political manipulation, identity theft, fraud, and defamation. The rapid spread of such content poses a serious threat to digital trust, public safety, and personal privacy. As a result, developing reliable and automated deepfake detection systems has become an important research challenge in the domains of computer vision and cybersecurity.
Traditional methods for image and video forgery detection rely on handcrafted features and statistical analysis. These approaches often fail to generalize well when faced with high-quality deepfake content, as modern manipulation techniques are capable of removing obvious visual artifacts. Moreover, many existing detection systems analyze individual frames independently, ignoring the temporal relationships between consecutive frames in a video. Since deepfake videos often contain subtle inconsistencies in facial motion, blinking patterns, or temporal transitions, frame-based methods alone are insufficient for robust detection.
To overcome these limitations, deep learning-based approaches have gained significant attention in recent years. Convolutional Neural Networks (CNNs) are widely used for extracting spatial features from images due to their ability to learn hierarchical representations of textures, edges, and facial structures. CNN-based models have demonstrated strong performance in image classification and face analysis tasks. However, CNNs process each frame independently and do not capture temporal dependencies present in video data. This limitation reduces their effectiveness in detecting deepfakes that exhibit inconsistencies across time rather than within a single frame.
Temporal modeling techniques, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, are designed to process sequential data and learn dependencies over time. In the context of deepfake detection, LSTM-based models can analyze frame-to-frame variations and detect unnatural motion patterns introduced during the manipulation process. Bidirectional LSTM (BiLSTM) networks further enhance this capability by processing sequences in both forward and backward directions, allowing the model to capture past and future contextual information simultaneously. This bidirectional analysis is particularly useful for identifying subtle temporal artifacts that are not apparent in individual frames.
In this project, a hybrid CNN–BiLSTM architecture is proposed for effective deepfake detection in both images and videos. The CNN component is responsible for extracting high-level spatial features from each frame, while the BiLSTM component models the temporal relationships between consecutive frames. By combining spatial and temporal learning, the proposed system is capable of detecting both visual artifacts and motion inconsistencies commonly found in deepfake content. A pretrained ResNet-18 model is employed as the CNN feature extractor to leverage transfer learning and improve feature representation with limited training data.
The system is designed to handle both image and video inputs in a unified manner. For image-based detection, a single image is replicated to form a fixed-length frame sequence, ensuring compatibility with the temporal model. For video-based detection, multiple frames are extracted from the input video and grouped into sequences of predefined length. All frames undergo preprocessing steps such as resizing, normalization, and tensor conversion to ensure consistent input dimensions. The extracted CNN features are then passed to the BiLSTM network, followed by a fully connected layer and sigmoid activation function to perform binary classification as either Real or Fake.
In addition to model development, this project emphasizes practical usability through system deployment. The trained deepfake detection model is integrated into both a desktop-based graphical user interface and a web-based application developed using the Flask framework. The desktop interface allows offline testing of images and videos, while the web application provides secure user authentication, real-time prediction, and video streaming capabilities. The use of password hashing and session management ensures basic security for user access. This dual deployment approach demonstrates the feasibility of applying deep learning-based deepfake detection systems in real-world scenarios.

Objectives
The primary objective of this project is to design and develop an effective deepfake detection system capable of accurately classifying manipulated and authentic images and videos. The project aims to address the limitations of traditional frame-based detection methods by incorporating both spatial and temporal feature analysis using deep learning techniques.
The specific objectives of the project are as follows:
1. To study and analyze deepfake generation techniques
To understand the working principles of modern deepfake generation methods such as GANs and autoencoders, and to identify the common spatial and temporal artifacts introduced during the manipulation process.
2. To design a hybrid CNN–BiLSTM deep learning architecture
To develop a model that combines the strengths of Convolutional Neural Networks for spatial feature extraction and Bidirectional Long Short-Term Memory networks for temporal sequence modeling, enabling effective detection of deepfake content.
3. To extract and preprocess facial frame sequences from images and videos
To implement a preprocessing pipeline that converts both images and videos into fixed-length frame sequences, ensuring consistent input dimensions for temporal modeling.
4. To capture spatial facial features using a pretrained CNN model
To utilize a pretrained ResNet-18 network for extracting high-level spatial features such as facial textures, edges, and manipulation artifacts from individual frames.
5. To model temporal inconsistencies using BiLSTM networks
To analyze frame-to-frame variations and temporal dependencies in video sequences in order to detect unnatural facial movements and transitions commonly present in deepfake videos.
6. To train and evaluate the proposed model using standard performance metrics
To assess the effectiveness of the deepfake detection system using metrics such as accuracy, precision, recall, and F1-score.
7. To support both image-based and video-based deepfake detection
To ensure that the system can process single images as well as video inputs by converting them into compatible frame sequences.
8. To develop a user-friendly testing and deployment environment
To implement both a desktop-based graphical user interface and a web-based application that allow users to upload images or videos and obtain real-time deepfake detection results.
9. To ensure secure access and practical usability of the system
To integrate user authentication, session management, and password hashing in the web application to enhance security and prevent unauthorized access.
10. To provide a scalable and deployable deepfake detection solution
To design the system in a modular and efficient manner so that it can be extended in the future to support larger datasets, advanced models, and real-world forensic applications.

• Demo Video
• Complete project
• Full project report
• Source code
• Complete project support by online
• Lifetime access
• Execution Guidelines
• Immediate (Download)

Requirement Specification
Software Requirements
The proposed deepfake detection system is a comprehensive application that integrates deep learning, computer vision, graphical user interfaces, and web-based deployment. To support these functionalities, a robust software environment is required. The software requirements include programming languages, deep learning frameworks, supporting libraries, development tools, and deployment technologies. Each component plays a crucial role in ensuring efficient data processing, accurate model training, reliable testing, and secure user interaction. The selection of these software tools is based on performance, compatibility, scalability, and ease of integration.

1. Python Programming Language
Python is used as the primary programming language for the implementation of the entire deepfake detection system. The choice of Python is motivated by its simplicity, readability, and extensive support for artificial intelligence and machine learning applications. Python allows developers to implement complex deep learning architectures such as Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks with minimal code complexity. Its high-level syntax enables faster prototyping and experimentation, which is essential for research-oriented projects.
In addition, Python provides excellent support for modular programming, allowing the project to be divided into multiple functional components such as dataset preparation, preprocessing, model training, evaluation, graphical user interface development, and web deployment. Python’s cross-platform nature ensures that the system can be executed on different operating systems such as Windows, Linux, and macOS without significant modifications. Furthermore, Python has a large developer community and extensive documentation, making it easier to troubleshoot issues and extend the system in the future.

2. PyTorch Deep Learning Framework
PyTorch is employed as the core deep learning framework for designing, training, and evaluating the CNN–BiLSTM model used in deepfake detection. PyTorch is widely preferred in research and academic environments due to its dynamic computation graph, which allows real-time modification of network structures during execution. This feature significantly simplifies debugging and experimentation with different model configurations.
PyTorch supports automatic differentiation, which enables efficient computation of gradients during backpropagation. This is essential for training deep neural networks with high accuracy. The framework also provides native support for GPU acceleration using CUDA, allowing the model to leverage hardware resources for faster training and inference. In this project, PyTorch is used to implement convolutional layers, recurrent BiLSTM layers, activation functions, loss functions such as binary cross-entropy, and optimization algorithms like Adam. PyTorch also supports model serialization, enabling the trained model to be saved and reused during testing and deployment.

3. Torchvision Library
Torchvision is a specialized library built on top of PyTorch to support computer vision tasks. It provides a wide range of image preprocessing utilities, including resizing, normalization, and tensor conversion, which are essential for preparing input data for deep learning models. In the proposed system, Torchvision ensures that all image and video frames are transformed into a consistent format compatible with the CNN architecture.
Torchvision also provides access to pretrained deep learning models such as ResNet-18, which is used as the CNN feature extractor in this project. The use of pretrained models enables transfer learning, allowing the system to benefit from features learned on large-scale datasets such as ImageNet. This significantly improves feature representation and reduces training time. Torchvision simplifies the integration of pretrained architectures and ensures standardized preprocessing across different stages of the system.

4. NumPy Numerical Computing Library
NumPy is used for numerical computation and efficient handling of multidimensional data structures. In deep learning applications, data is often represented in the form of arrays and tensors, and NumPy provides optimized operations for such data. NumPy is used in this project for data manipulation, mathematical computations, and conversion between different data formats.
During model evaluation, NumPy is used to convert PyTorch tensors into arrays for metric computation. It also supports efficient memory management and fast numerical operations, which are essential when dealing with large datasets and feature vectors. NumPy’s reliability and performance make it a fundamental component of the deepfake detection system.

5. OpenCV (Computer Vision Library)
OpenCV is used extensively for video processing and frame extraction in the proposed system. It provides powerful tools for reading video files, capturing frames, resizing images, and performing color space conversions. OpenCV enables the system to process video inputs by extracting multiple frames that represent temporal variations in facial expressions and movements.
In addition to frame extraction, OpenCV supports real-time video streaming and playback, which is utilized in both the desktop and web-based interfaces. Its high performance and optimized algorithms make it suitable for real-time computer vision applications. OpenCV plays a critical role in bridging the gap between raw video input and deep learning-based analysis.

6. Pillow (PIL) Image Processing Library
The Pillow library is used for image handling and manipulation tasks throughout the project. It supports loading images in various formats, converting color modes, resizing images, and saving image files. Pillow ensures seamless compatibility between image files and deep learning pipelines.
In the desktop-based GUI and web application, Pillow is used to display images and extracted video frames to users. Its simplicity and flexibility make it suitable for both backend processing and frontend visualization tasks. Pillow enhances the overall usability of the system by enabling smooth image handling across different interfaces.

7. Scikit-learn Machine Learning Library
Scikit-learn is used for evaluating the performance of the trained deepfake detection model. It provides standardized and widely accepted implementations of evaluation metrics such as accuracy, precision, recall, and F1-score. These metrics are essential for objectively assessing the effectiveness of the proposed CNN–BiLSTM model.
Scikit-learn ensures consistency in performance evaluation and allows easy comparison with existing methods. Its integration with NumPy and PyTorch makes it suitable for post-training analysis and result reporting.

8. Matplotlib Visualization Library
Matplotlib is used for generating graphical representations of training and evaluation results. It enables visualization of training loss curves, accuracy trends, and comparison of evaluation metrics. These visualizations help in analyzing model convergence, detecting overfitting, and understanding overall performance behavior.
Graphs generated using Matplotlib are also included in the project report to support experimental analysis and result discussion. Visualization plays an important role in interpreting deep learning models and presenting findings in an understandable manner.

9. Tkinter Graphical User Interface Toolkit
Tkinter is used to develop a desktop-based graphical user interface for offline testing of the deepfake detection system. It provides a lightweight and easy-to-use framework for building interactive applications. The GUI allows users to upload images and videos, view extracted frames, and observe real-time classification results.
Tkinter enhances user interaction and makes the system accessible to non-technical users. It serves as an effective demonstration tool during project evaluations and presentations.

10. Flask Web Framework
Flask is used to develop the web-based deployment of the deepfake detection system. It provides a lightweight framework for handling HTTP requests, routing, file uploads, session management, and server-side logic. Flask enables users to interact with the deepfake detection model through a web browser.
The web application includes secure user authentication, prediction interfaces for image and video uploads, and real-time video streaming. Flask’s modular design allows easy integration with deep learning models and databases, making it suitable for scalable and secure deployment.

11. SQLite Database
SQLite is used as the backend database for storing user credentials in the web application. It is a lightweight, serverless database that requires minimal configuration and maintenance. SQLite supports secure storage of user data and integrates seamlessly with Flask.
The database stores user information such as usernames, email addresses, and encrypted passwords. SQLite is suitable for small- to medium-scale applications and provides sufficient performance for authentication and session management.

12. Development Environment and Tools
The project is developed using an integrated development environment such as Visual Studio Code. The IDE provides features such as syntax highlighting, debugging tools, and extension support, which enhance development productivity. It also supports version control and code organization, enabling efficient project management.

Hardware Requirements
The proposed deepfake detection system requires a robust and reliable hardware setup to efficiently support computationally intensive operations such as deep learning model training, video frame processing, real-time inference, and system deployment. A multi-core Central Processing Unit (CPU) is essential for handling general-purpose tasks including data preprocessing, frame extraction from videos, file input/output operations, graphical user interface execution, and web server request handling. Although the primary computational load of deep learning operations is handled by the GPU, the CPU plays a critical role in coordinating system processes and ensuring smooth execution without bottlenecks. In addition, a Graphics Processing Unit (GPU) with CUDA support is highly recommended to accelerate the training and inference of the CNN–BiLSTM model, as deep learning operations involve large-scale matrix multiplications and parallel computations. The use of a dedicated GPU significantly reduces training time and enables efficient handling of high-resolution images and video sequences. Adequate Random Access Memory (RAM) is required to store datasets, extracted frame sequences, intermediate feature representations, and model parameters during execution. A minimum of 8 GB RAM is necessary for basic operation, while higher memory capacity ensures smoother multitasking and faster data loading during training and testing phases. Storage resources are also a crucial hardware requirement, as the system must store raw image and video datasets, preprocessed frame sequences, trained model files, evaluation results, and user-uploaded media files in the deployed application. High-speed storage devices such as solid-state drives improve data access speed and reduce latency during model training and inference. A display unit with sufficient resolution is required to visualize the graphical user interface, extracted video frames, prediction outputs, and training performance graphs, while standard input devices such as a keyboard and mouse facilitate user interaction and system control. Network connectivity is required for downloading datasets, pretrained models, and software dependencies, as well as for enabling web-based deployment and remote access to the application. The system is designed to operate on a 64-bit operating system that supports modern deep learning frameworks and GPU drivers, ensuring compatibility and stability. Overall, the selected hardware configuration ensures reliable performance, scalability, and smooth execution of the deepfake detection system across training, testing, and deployment environments.