Crawl Domain

Crawl Status

Site Crawler Documentation

Overview

The Site Crawler is an ASP.NET Core application designed to crawl websites, gather URLs, and generate sitemaps. It uses HttpClient for making web requests and SignalR for providing real-time updates to the users. The crawler is built with scalability, efficiency, and resilience in mind, following best practices for modern web crawling.

Key Features

  • Asynchronous Crawling: Utilizes async/await for non-blocking I/O operations, making it efficient in handling large numbers of web requests.
  • Error Handling and Resilience: Implements retries and fallback mechanisms with the Polly library to gracefully manage transient faults and network issues.
  • Real-Time Feedback: Uses SignalR to provide live updates on crawling progress, enhancing user experience with immediate feedback.
  • Robust Security: Complies with web standards and respects robots.txt directives, ensuring that crawling is performed ethically and legally.

Implementation Details

1. Architecture

The Site Crawler consists of several components:

  • SiteCrawler Class: The core component responsible for initiating HTTP requests, parsing results, and managing the crawl queue.
  • CrawlDomainController: The MVC controller that handles user interactions and triggers crawling operations.
  • SignalR Hub: Facilitates real-time communication between the server and clients to update users on crawl progress.

2. Key Methods

The SiteCrawler class includes key methods such as:

  • CrawlAsync: Initiates the crawling process, manages the queue of URLs, and handles the crawling depth and limits.
  • CrawlPageAsync: Crawls individual pages, handles HTTP responses, and manages error states.
  • GenerateSitemapXml: Generates a sitemap in XML format, adhering to the standard sitemap protocol.

Best Practices

1. Efficient Crawling

Optimize your crawling by setting sensible limits on depth and the number of pages. This helps prevent overloading your system and respects target websites' bandwidth.

2. Error Handling

Use comprehensive error handling strategies with detailed logging to diagnose and recover from failures effectively. Implement retries for transient errors using policies like exponential backoff.

3. Respect Robots.txt

Always check the robots.txt file of the target website to ensure that your crawler does not access restricted areas, which could lead to legal issues.

4. Provide Real-Time Feedback

Enhance user experience by utilizing SignalR to send real-time updates on the crawl progress. This keeps users informed and engaged throughout the process.

Setup and Configuration

1. Installation

Add the necessary packages to your ASP.NET Core project:

dotnet add package Microsoft.AspNetCore.SignalR
dotnet add package Microsoft.Extensions.Http
dotnet add package Polly

2. Configuration

Configure the HttpClient and SignalR in your Program.cs:


var builder = WebApplication.CreateBuilder(args);

// Configure services
builder.Services.AddHttpClient();
builder.Services.AddSignalR();
builder.Services.AddScoped();

var app = builder.Build();

// Configure middleware
app.UseRouting();
app.UseEndpoints(endpoints =>
{
    endpoints.MapHub("/crawlHub");
    endpoints.MapControllers();
});

app.Run();