Trace rising topics from signal tocross-platform evidence.
MindSpider identifies emerging public conversations, expands them into structured crawl queues, and captures platform-level discussion data analysts can inspect, store, and reuse.
Daily feeds across social, technical, and community surfaces.
Platform-specific passes for posts, comments, and engagement evidence.
Broad topic extraction first, platform-level sentiment crawling second.
System Shape
Discovery to crawl, without the manual gap
Discovery Sources
Daily signal intake
Weibo, Zhihu, Bilibili, Toutiao, GitHub, CoolApk, and adjacent feeds seed the topic graph before deeper crawling begins.
Agent Layer
AI topic extraction
Model-assisted summarization produces topic names, summaries, and keyword lists from noisy daily sources.
Crawl Queue
Keyword fan-out
The extracted topics become crawl tasks for each platform adapter, keeping downstream work tied to explicit evidence.
Platform Pass
Deep sentiment crawling
Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu are crawled with browser automation to capture comments, reactions, and discussion context.
Output
Tables + reports
Data lands in explicit tables like `daily_topics`, `topic_news_relation`, and `crawling_tasks`.
Discovery sources
The broad pass is designed to recognize momentum before you choose a crawl target.
Deep-crawl targets
The second pass goes deeper on the platforms where sentiment, discussion, and feedback actually live.
Data outputs
What comes out is designed for operators, not just demos.
How It Works
A crawler pipeline shaped like an analyst workflow.
Discover rising topics
MindSpider pulls daily hot signals from news and community sources, then uses AI extraction to turn raw headlines into reusable topic clusters.
Fan out into platform crawls
Those topic clusters become structured keyword queues for deep crawls across Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu.
Persist evidence for analysis
Tasks, content, and relationships are written into MySQL-ready tables so you can review trajectories, compare platforms, and build downstream reports.
Architecture
Three lanes, one intent: make topic movement inspectable.
Daily signal intake
Broad Topic Extraction
The first lane watches public trend surfaces, normalizes source data, and asks the model layer to produce topics worth pursuing.
- Daily news and hot-list collection
- AI-generated topic summaries
- Keyword lists written to durable storage
Platform-specific evidence collection
Deep Sentiment Crawling
The second lane takes the extracted keywords and turns them into structured crawl tasks for each target platform.
- Per-platform crawler adapters
- Login-aware browser sessions
- Comment, post, and interaction capture
Tables, tasks, and replayability
Structured Output Layer
Instead of dumping text into blobs, MindSpider stores topic relations, crawl progress, and platform outputs in explicit database structures.
- MySQL-oriented persistence
- Task progress and status tracking
- Reusable datasets for reports and follow-on agents
Open-Source Status
MindSpider is live as a project identity, and its latest implementation path now runs through BettaFish.
The original MindSpider repository still documents the pipeline clearly. The maintainers now position the latest code inside BettaFish, so this site keeps the original project story and the current upstream path in the same frame.
- Use this site as the product-facing front door for the project story.
- Use the GitHub repository and README for the current setup path.
- Treat the /start route as a repository-first evaluation handoff, not a hosted signup flow.
Upstream repositories
Keep both links visible so operators can read the original README and follow the newer monorepo path without guessing.
Feature Surface
Product language for a system that still respects the code.
AI topic extraction
Convert noisy daily news and hot lists into themes, summaries, and keyword sets that agents can keep working with.
Playwright-first crawling
Browser automation is built into the deep crawl layer, making dynamic pages and login-heavy flows more realistic to operate.
Platform-aware storage
Outputs are mapped into structured tables for notes, videos, threads, tasks, and topic relationships instead of loose export files.
Keyword queue control
The system manages topic-to-keyword fan-out so follow-up crawls stay tied to the signals that triggered them.
Open-source inspectability
Everything is visible in code: pipeline stages, database schema, platform adapters, and the operational assumptions around them.
Built for analyst handoff
The output is meant to be reviewed, queried, and reused by humans or later agents instead of dying as a one-shot scrape.
Frequently Asked Questions
Is MindSpider a hosted SaaS product today?+
No. This site presents MindSpider as an open-source project and product identity. The /start route is a setup handoff page that points you to the public repositories and README.
What does the two-stage pipeline actually mean?+
Stage one identifies promising topics from daily feeds. Stage two takes the resulting keywords and runs deeper platform crawls to gather sentiment-bearing evidence.
Which technologies shape the implementation?+
The README centers Python, Playwright, MySQL, asyncio, and a DeepSeek-compatible analysis layer for topic extraction and downstream interpretation.
Can I self-host it?+
Yes. The project is presented as an inspectable open-source system, so the primary path today is repository-driven setup rather than a closed hosted dashboard.
Why mention BettaFish here?+
Because the upstream README now states that the latest MindSpider code is maintained as a submodule inside BettaFish. Linking both avoids sending users to stale expectations.
Setup Path
Review the current setup path, then inspect the system at source level.
Today that means a repository-first handoff: the setup page, the public README, and the upstream repositories. The site is meant to clarify the system before you decide to run it.