Skip to content

Latest commit

 

History

History
136 lines (90 loc) · 12.4 KB

File metadata and controls

136 lines (90 loc) · 12.4 KB
graph LR
    Crawl_Orchestrator["Crawl Orchestrator"]
    Configuration_Initialization_Manager["Configuration & Initialization Manager"]
    Crawl_State_Coverage_Manager["Crawl State & Coverage Manager"]
    Confidence_Quality_Assessor["Confidence & Quality Assessor"]
    Link_Ranking_Selection_Engine["Link Ranking & Selection Engine"]
    Semantic_Analysis_Unit["Semantic Analysis Unit"]
    Knowledge_Base_Persistence["Knowledge Base Persistence"]
    Reporting_Statistics_Generator["Reporting & Statistics Generator"]
    Configuration_Initialization_Manager -- "initializes" --> Crawl_Orchestrator
    Configuration_Initialization_Manager -- "configures" --> Semantic_Analysis_Unit
    Configuration_Initialization_Manager -- "configures" --> Link_Ranking_Selection_Engine
    Crawl_Orchestrator -- "updates state in" --> Crawl_State_Coverage_Manager
    Crawl_Orchestrator -- "retrieves coverage from" --> Crawl_State_Coverage_Manager
    Crawl_Orchestrator -- "requests quality confidence from" --> Confidence_Quality_Assessor
    Crawl_Orchestrator -- "requests ranked links from" --> Link_Ranking_Selection_Engine
    Crawl_Orchestrator -- "utilizes" --> Semantic_Analysis_Unit
    Crawl_Orchestrator -- "manages persistence via" --> Knowledge_Base_Persistence
    Crawl_Orchestrator -- "provides data to" --> Reporting_Statistics_Generator
    Crawl_State_Coverage_Manager -- "provides content data to" --> Confidence_Quality_Assessor
    Crawl_State_Coverage_Manager -- "provides content data to" --> Link_Ranking_Selection_Engine
    Confidence_Quality_Assessor -- "provides quality metrics to" --> Reporting_Statistics_Generator
    Link_Ranking_Selection_Engine -- "requests semantic analysis from" --> Semantic_Analysis_Unit
Loading

CodeBoardingDemoContact

Details

The AdaptiveCrawler subsystem is designed around a central Crawl Orchestrator that manages an iterative, adaptive crawling process. The Configuration & Initialization Manager sets up the system, including the Crawl Orchestrator and its key dependencies. During execution, the Crawl Orchestrator continuously interacts with the Crawl State & Coverage Manager to update and retrieve the current crawl state and coverage metrics. For intelligent link prioritization, the Crawl Orchestrator requests ranked links from the Link Ranking & Selection Engine, which in turn leverages the Crawl State & Coverage Manager for content data and the Semantic Analysis Unit for deep semantic understanding. The Semantic Analysis Unit is also directly utilized by the Crawl Orchestrator for mapping queries. To assess the quality and completeness of the crawl, the Crawl Orchestrator queries the Confidence & Quality Assessor, which relies on the Crawl State & Coverage Manager for raw content data. All crawl results and learned knowledge are managed for persistence by the Knowledge Base Persistence component, orchestrated by the Crawl Orchestrator. Finally, the Reporting & Statistics Generator provides insights into the crawl's performance, receiving data from both the Crawl Orchestrator and the Confidence & Quality Assessor.

Crawl Orchestrator

Orchestrates the entire adaptive crawling process, managing the flow between different stages, including batch processing, state updates, and stopping conditions. This acts as the central control for the adaptive loop.

Related Classes/Methods:

Configuration & Initialization Manager

Manages the setup and validation of adaptive strategies for the crawler.

Related Classes/Methods:

Crawl State & Coverage Manager

Updates and maintains the internal state of the crawler, including tokenizing content and computing the shape of the coverage achieved.

Related Classes/Methods:

Confidence & Quality Assessor

Calculates various confidence metrics (consistency, saturation, coverage) to assess the quality and completeness of the crawled data, crucial for adaptive adjustments.

Related Classes/Methods:

Link Ranking & Selection Engine

Prioritizes and selects links for further exploration based on relevance, novelty, and identified coverage gaps, driving the adaptive nature of the crawl.

Related Classes/Methods:

Semantic Analysis Unit

Maps queries and documents into a semantic space using embeddings, facilitating advanced analysis like distance matrix computation for coverage and relevance, which is vital for AI/LLM integration.

Related Classes/Methods:

Knowledge Base Persistence

Manages the saving, loading, importing, and exporting of crawl results and the accumulated knowledge base, ensuring the persistence of learned information for adaptive strategies.

Related Classes/Methods:

Reporting & Statistics Generator

Provides insights into the crawl progress, coverage, and overall quality through various statistics and reports, aiding in monitoring and evaluating adaptive performance.

Related Classes/Methods: