About SITI
Sarawak Intelligent Tourism Insights is a decision-support system built for the Ministry of Tourism, Creative Industry and Performing Arts Sarawak (MTCP). It scrapes public data sources, runs AI analysis, and produces ministerial-grade intelligence that goes beyond standard tourism board reports.
What SITI Does
SITI operates across five integrated modules, each designed to answer a specific question a minister or policy analyst would ask.
Module 1: Event Impact Intelligence
For each major Sarawak event, SITI surfaces what really happened — beyond attendance numbers. It tracks social signal volume across Instagram, TikTok, and Facebook in the 30 days before, during, and after the event. It cross-references hotel price spikes, news coverage sentiment, and visitor reviews that mention the event by name. The post-event impact report estimates total spending, attendance, first-time visitor percentage, and flags recommendations for the following year.
- Rainforest World Music Festival (RWMF)
- Borneo Cultural Festival
- Kuching Festival
- Gawai Dayak
- Hornbill Festival
- Sarawak International Dragon Boat Regatta
- What About Kuching (WAK)
- Sibu International Dance Festival
- Miri International Jazz Festival
- Pesta Babulang
Module 2: Heritage & Nature Site Health Index
A live 0-to-100 composite score for each heritage and nature site, updated weekly. The index combines three weighted components: sentiment from reviews over the last 90 days (40%), review volume versus the trailing 12-month baseline (30%), and sentiment direction over the last 90 versus prior 90 days (30%). Each site also receives an AI-generated weekly brief surfacing top complaints, top praises, reviewer origin breakdown, and seasonal patterns.
- Gunung Mulu National Park
- Niah National Park
- Bako National Park
- Semenggoh Wildlife Centre
- Annah Rais Longhouse
- Sarawak Cultural Village
- Fort Margherita
- Astana
- Borneo Cultures Museum
- Kuching Waterfront
Module 3: Spending Trends (Stubbed)
A forward-looking module that demonstrates what SITI will surface when DOSM Tourism Satellite Account data integration is approved. Currently powered by publicly available headline figures with plausible divisional breakdowns. Every chart is marked with a clear data-pending badge so the UI is fully functional — only the data source needs to swap when real data lands.
- Tourist spending by quarter (state-level)
- Spending by category: accommodation, F&B, retail, transport, attractions
- Top source markets by spend
- Per-visitor average spend trend
- Division-level heat map of spending intensity
Module 4: Environmental Impact Correlation (Stubbed)
A concept demonstration of how environmental factors correlate with tourism patterns. Uses real weather data from Open-Meteo (a free, no-scrape API) correlated against scraped visitor review volume. Surface-level air quality and forest fire incident data correlate against event attendance and heritage site sentiment, marked as pending NRECC integration.
Module 5: AI Briefing Studio
The flagship feature. A minister or analyst picks a scope (an event, an attraction, a date range, or a custom question), hits generate, and in 15 to 25 seconds receives a polished one-page executive brief with an executive summary, key findings backed by data references, actionable recommendations, and confidence notes that distinguish solid evidence from inference. Every numeric claim carries an inline citation traceable to the source data. Briefings are exportable and every generation run is logged for audit.
Data Sources
SITI does not receive any data from the Ministry. All data is collected from publicly accessible sources. No authenticated APIs, no credentials, no privileged access.
| Source | Data collected | Method |
|---|---|---|
| Google Maps | Reviews, ratings, reviewer origin, review dates | Playwright browser |
| TripAdvisor | Reviews, ratings, reviewer profiles, travel dates | Playwright browser |
| Booking.com | Hotel names, prices, star ratings, availability, check-in dates | Playwright browser |
| Agoda | Hotel names, prices, star ratings, availability, check-in dates | Playwright browser |
| Hashtag post counts, estimated reach per hashtag | HTTP request | |
| TikTok | Hashtag post counts, estimated reach per hashtag | HTTP request |
| Public event listings, hashtag volume | HTTP request | |
| Borneo Post | News articles (full text), publication dates, categories | BeautifulSoup |
| Dayak Daily | News articles (full text), publication dates, categories | BeautifulSoup |
| The Star Sarawak | News articles (full text), publication dates, categories | BeautifulSoup |
| STB Website | Official tourism statistics, event calendars | BeautifulSoup |
| Open-Meteo | Weather: rainfall, temperature, air quality index | Free API (no key) |
| DOSM | Tourism Satellite Account headline figures (public) | Manual anchor data |
Collection Schedule
All times are Malaysia/Kuching (MYT, UTC+8). The scraper runs as a daemon with an in-process scheduler (APScheduler). Every run writes a log row recording status, duration, item count, and any errors.
| Source | Frequency | Time |
|---|---|---|
| Open-Meteo weather | Daily | 01:00 |
| STB website | Weekly | Mon 02:00 |
| Google Maps reviews | Weekly per attraction | Tue–Sat 02:00–05:00 (rolling) |
| TripAdvisor reviews | Weekly per attraction | Tue–Sat 02:00–05:00 (rolling) |
| Booking.com / Agoda price scans | Daily | 03:00 |
| Borneo Post / Dayak Daily / The Star | Daily | 04:00 |
| Instagram hashtag volume | Daily | 06:00 |
| TikTok hashtag volume | Daily | 06:30 |
| Facebook public events | Weekly | Wed 02:00 |
AI Pipeline Schedule
| Pipeline | Frequency | Description |
|---|---|---|
| Sentiment scoring | Per review, async | Each scraped review is language-detected, translated if needed, sentiment-scored via DeepSeek, and embedded with BGE-M3. |
| Topic clustering | Weekly | Reviews per attraction clustered via HDBSCAN on embeddings; DeepSeek labels each cluster with a 2-4 word topic. |
| Health index computation | Weekly | Composite 0-100 score per attraction from sentiment, volume, and trend components. |
| Cross-pillar correlation | Nightly | Builds the Apache AGE graph: events linked to nearby attractions, hotel price spikes, news mentions, and social signals. |
| Forecasting | On-demand + weekly | Prophet models trained per metric (hotel demand, review volume) with 30/60/90-day horizons. |
| Briefing generation | On-demand | User-scoped; DeepSeek Reasoner drafts, citations resolved, linter gate enforced before persistence. |
How It Works
Three-Service Architecture
SITI is built as three independent services that share a single PostgreSQL database and Redis instance. This separation means each service can be restarted, scaled, or replaced independently.
siti-scraper
Python 3.12, Crawlee, Playwright, APScheduler
All web scraping. Reads public pages, extracts structured data, writes raw and cleaned records to PostgreSQL. Runs on a cron-style schedule independent of the web app.
siti-intel
Python 3.12, FastAPI, pandas, Prophet, DeepSeek
AI analysis engine. Runs sentiment scoring, topic clustering, health index computation, cross-pillar correlation via Apache AGE graph database, Prophet-based forecasting, and briefing generation via DeepSeek Reasoner with streaming SSE output.
siti-web
Next.js 15, TypeScript, Prisma, Tailwind CSS
User-facing frontend and API routes. Serves the dashboard, handles authentication, and proxies AI requests to siti-intel. All charts rendered with Recharts, maps with Mapbox GL JS.
Infrastructure
The database and Redis instance run on a Synology NAS (nas1) with PostgreSQL 16, pgvector for embedding search, and Apache AGE for graph-based cross-pillar correlation. The user-facing edge is a Hostinger VPS running Traefik for SSL termination and reverse proxy. The VPS connects to nas1 over an SSH reverse tunnel — no database ports are exposed to the internet.
Briefing Generation Pipeline
- User picks scope in the UI (event, attraction, date range, or custom)
bundler.pyassembles a context bundle: all relevant reviews, news articles, hotel signals, social signals, and forecasts for the scope- DeepSeek Reasoner receives the bundle with a system prompt establishing it as a senior tourism intelligence analyst writing for the Sarawak Minister of Tourism
- The model streams response tokens back to the browser over SSE (Server-Sent Events) in real time
citations.pyresolves every inline citation token against the reference map; orphaned citations cause rejectionlinter.pyruns six deterministic quality rules. Any violation marks the run as FAILED and prevents persistence- The briefing is persisted transactionally with its data references and generation run metadata
Technology Stack
Frontend (siti-web)
- Next.js 15 (App Router)
- TypeScript (strict mode)
- Prisma ORM
- Tailwind CSS
- shadcn/ui (Radix primitives)
- Recharts
- Mapbox GL JS
- NextAuth.js v5
- date-fns
- BullMQ + Redis
AI Engine (siti-intel)
- Python 3.12
- FastAPI
- SQLAlchemy 2.0 (async)
- pandas 2.x
- Prophet (Meta)
- DeepSeek Reasoner (briefings)
- DeepSeek Chat (bulk ops)
- sentence-transformers + BGE-M3
- HDBSCAN
- Pydantic v2
Scraper (siti-scraper)
- Python 3.12
- Crawlee for Python
- Playwright
- BeautifulSoup
- APScheduler
- SQLAlchemy 2.0 (async)
- Bright Data proxies
Infrastructure
- PostgreSQL 16 + pgvector + Apache AGE
- Redis 7
- Docker Compose (NAS)
- Traefik + Let's Encrypt (VPS)
- SSH reverse tunnel (VPS to NAS)
- Tailscale (NAS networking)
- Synology DS1525+ (31 GB RAM)
Data Collection Ethics
- Public data only. SITI only accesses publicly available web pages. No authenticated APIs, no credential harvesting, no privileged access.
- robots.txt respected. Sites with robots.txt restrictions (Google, Meta) are only accessed where permitted.
- No PII beyond public display. Reviewer names and profile information are only stored as they appear publicly on review platforms.
- Rate limited. Maximum 1 request per 3 seconds per domain by default, tuned per source to avoid server load.
- Residential proxies. Bright Data residential IP pool used for review sites to respect geographic rate limits. Direct connections used where permitted.
- User-agent transparency. Crawlee's built-in fingerprint rotation identifies the scraper honestly; no impersonation of real browsers.
- No silent data loss. Partial scrapes are flagged as incomplete. Three consecutive failures on any source trigger an alert. Nothing is silently dropped.
- Audit trail. Every scrape run and every AI generation is logged with status, duration, and item count for full traceability.
Project Context
SITI is a demonstration system built by Aidju Digital to win a paid engagement with MTCP Sarawak. The Ministry does not provide data, so SITI scrapes public sources, runs AI analysis, and produces decision-grade intelligence the Ministry cannot get from standard Sarawak Tourism Board reports.
This is a pre-engagement demo. It is not a finished product. The demo's job is to demonstrate capability, not to ship at production scale. All modules render with real data or clearly marked stubs. Paid engagement scope (multi-tenant, Bahasa Melayu, real-time dashboards, public API, neural forecasting) is documented and awaits ministerial sign-off.
Last updated: 24 May 2026. Status: Demo build, feature-complete.