SITI

About SITI

Sarawak Intelligent Tourism Insights is a decision-support system built for the Ministry of Tourism, Creative Industry and Performing Arts Sarawak (MTCP). It scrapes public data sources, runs AI analysis, and produces ministerial-grade intelligence that goes beyond standard tourism board reports.

What SITI Does

SITI operates across five integrated modules, each designed to answer a specific question a minister or policy analyst would ask.

Module 1: Event Impact Intelligence

For each major Sarawak event, SITI surfaces what really happened — beyond attendance numbers. It tracks social signal volume across Instagram, TikTok, and Facebook in the 30 days before, during, and after the event. It cross-references hotel price spikes, news coverage sentiment, and visitor reviews that mention the event by name. The post-event impact report estimates total spending, attendance, first-time visitor percentage, and flags recommendations for the following year.

  • Rainforest World Music Festival (RWMF)
  • Borneo Cultural Festival
  • Kuching Festival
  • Gawai Dayak
  • Hornbill Festival
  • Sarawak International Dragon Boat Regatta
  • What About Kuching (WAK)
  • Sibu International Dance Festival
  • Miri International Jazz Festival
  • Pesta Babulang

Module 2: Heritage & Nature Site Health Index

A live 0-to-100 composite score for each heritage and nature site, updated weekly. The index combines three weighted components: sentiment from reviews over the last 90 days (40%), review volume versus the trailing 12-month baseline (30%), and sentiment direction over the last 90 versus prior 90 days (30%). Each site also receives an AI-generated weekly brief surfacing top complaints, top praises, reviewer origin breakdown, and seasonal patterns.

  • Gunung Mulu National Park
  • Niah National Park
  • Bako National Park
  • Semenggoh Wildlife Centre
  • Annah Rais Longhouse
  • Sarawak Cultural Village
  • Fort Margherita
  • Astana
  • Borneo Cultures Museum
  • Kuching Waterfront

Module 3: Spending Trends (Stubbed)

A forward-looking module that demonstrates what SITI will surface when DOSM Tourism Satellite Account data integration is approved. Currently powered by publicly available headline figures with plausible divisional breakdowns. Every chart is marked with a clear data-pending badge so the UI is fully functional — only the data source needs to swap when real data lands.

  • Tourist spending by quarter (state-level)
  • Spending by category: accommodation, F&B, retail, transport, attractions
  • Top source markets by spend
  • Per-visitor average spend trend
  • Division-level heat map of spending intensity

Module 4: Environmental Impact Correlation (Stubbed)

A concept demonstration of how environmental factors correlate with tourism patterns. Uses real weather data from Open-Meteo (a free, no-scrape API) correlated against scraped visitor review volume. Surface-level air quality and forest fire incident data correlate against event attendance and heritage site sentiment, marked as pending NRECC integration.

Module 5: AI Briefing Studio

The flagship feature. A minister or analyst picks a scope (an event, an attraction, a date range, or a custom question), hits generate, and in 15 to 25 seconds receives a polished one-page executive brief with an executive summary, key findings backed by data references, actionable recommendations, and confidence notes that distinguish solid evidence from inference. Every numeric claim carries an inline citation traceable to the source data. Briefings are exportable and every generation run is logged for audit.

Data Sources

SITI does not receive any data from the Ministry. All data is collected from publicly accessible sources. No authenticated APIs, no credentials, no privileged access.

SourceData collectedMethod
Google MapsReviews, ratings, reviewer origin, review datesPlaywright browser
TripAdvisorReviews, ratings, reviewer profiles, travel datesPlaywright browser
Booking.comHotel names, prices, star ratings, availability, check-in datesPlaywright browser
AgodaHotel names, prices, star ratings, availability, check-in datesPlaywright browser
InstagramHashtag post counts, estimated reach per hashtagHTTP request
TikTokHashtag post counts, estimated reach per hashtagHTTP request
FacebookPublic event listings, hashtag volumeHTTP request
Borneo PostNews articles (full text), publication dates, categoriesBeautifulSoup
Dayak DailyNews articles (full text), publication dates, categoriesBeautifulSoup
The Star SarawakNews articles (full text), publication dates, categoriesBeautifulSoup
STB WebsiteOfficial tourism statistics, event calendarsBeautifulSoup
Open-MeteoWeather: rainfall, temperature, air quality indexFree API (no key)
DOSMTourism Satellite Account headline figures (public)Manual anchor data

Collection Schedule

All times are Malaysia/Kuching (MYT, UTC+8). The scraper runs as a daemon with an in-process scheduler (APScheduler). Every run writes a log row recording status, duration, item count, and any errors.

SourceFrequencyTime
Open-Meteo weatherDaily01:00
STB websiteWeeklyMon 02:00
Google Maps reviewsWeekly per attractionTue–Sat 02:00–05:00 (rolling)
TripAdvisor reviewsWeekly per attractionTue–Sat 02:00–05:00 (rolling)
Booking.com / Agoda price scansDaily03:00
Borneo Post / Dayak Daily / The StarDaily04:00
Instagram hashtag volumeDaily06:00
TikTok hashtag volumeDaily06:30
Facebook public eventsWeeklyWed 02:00

AI Pipeline Schedule

PipelineFrequencyDescription
Sentiment scoringPer review, asyncEach scraped review is language-detected, translated if needed, sentiment-scored via DeepSeek, and embedded with BGE-M3.
Topic clusteringWeeklyReviews per attraction clustered via HDBSCAN on embeddings; DeepSeek labels each cluster with a 2-4 word topic.
Health index computationWeeklyComposite 0-100 score per attraction from sentiment, volume, and trend components.
Cross-pillar correlationNightlyBuilds the Apache AGE graph: events linked to nearby attractions, hotel price spikes, news mentions, and social signals.
ForecastingOn-demand + weeklyProphet models trained per metric (hotel demand, review volume) with 30/60/90-day horizons.
Briefing generationOn-demandUser-scoped; DeepSeek Reasoner drafts, citations resolved, linter gate enforced before persistence.

How It Works

Three-Service Architecture

SITI is built as three independent services that share a single PostgreSQL database and Redis instance. This separation means each service can be restarted, scaled, or replaced independently.

siti-scraper

Python 3.12, Crawlee, Playwright, APScheduler

All web scraping. Reads public pages, extracts structured data, writes raw and cleaned records to PostgreSQL. Runs on a cron-style schedule independent of the web app.

siti-intel

Python 3.12, FastAPI, pandas, Prophet, DeepSeek

AI analysis engine. Runs sentiment scoring, topic clustering, health index computation, cross-pillar correlation via Apache AGE graph database, Prophet-based forecasting, and briefing generation via DeepSeek Reasoner with streaming SSE output.

siti-web

Next.js 15, TypeScript, Prisma, Tailwind CSS

User-facing frontend and API routes. Serves the dashboard, handles authentication, and proxies AI requests to siti-intel. All charts rendered with Recharts, maps with Mapbox GL JS.

Infrastructure

The database and Redis instance run on a Synology NAS (nas1) with PostgreSQL 16, pgvector for embedding search, and Apache AGE for graph-based cross-pillar correlation. The user-facing edge is a Hostinger VPS running Traefik for SSL termination and reverse proxy. The VPS connects to nas1 over an SSH reverse tunnel — no database ports are exposed to the internet.

Briefing Generation Pipeline

  1. User picks scope in the UI (event, attraction, date range, or custom)
  2. bundler.py assembles a context bundle: all relevant reviews, news articles, hotel signals, social signals, and forecasts for the scope
  3. DeepSeek Reasoner receives the bundle with a system prompt establishing it as a senior tourism intelligence analyst writing for the Sarawak Minister of Tourism
  4. The model streams response tokens back to the browser over SSE (Server-Sent Events) in real time
  5. citations.py resolves every inline citation token against the reference map; orphaned citations cause rejection
  6. linter.py runs six deterministic quality rules. Any violation marks the run as FAILED and prevents persistence
  7. The briefing is persisted transactionally with its data references and generation run metadata

Technology Stack

Frontend (siti-web)

  • Next.js 15 (App Router)
  • TypeScript (strict mode)
  • Prisma ORM
  • Tailwind CSS
  • shadcn/ui (Radix primitives)
  • Recharts
  • Mapbox GL JS
  • NextAuth.js v5
  • date-fns
  • BullMQ + Redis

AI Engine (siti-intel)

  • Python 3.12
  • FastAPI
  • SQLAlchemy 2.0 (async)
  • pandas 2.x
  • Prophet (Meta)
  • DeepSeek Reasoner (briefings)
  • DeepSeek Chat (bulk ops)
  • sentence-transformers + BGE-M3
  • HDBSCAN
  • Pydantic v2

Scraper (siti-scraper)

  • Python 3.12
  • Crawlee for Python
  • Playwright
  • BeautifulSoup
  • APScheduler
  • SQLAlchemy 2.0 (async)
  • Bright Data proxies

Infrastructure

  • PostgreSQL 16 + pgvector + Apache AGE
  • Redis 7
  • Docker Compose (NAS)
  • Traefik + Let's Encrypt (VPS)
  • SSH reverse tunnel (VPS to NAS)
  • Tailscale (NAS networking)
  • Synology DS1525+ (31 GB RAM)

Data Collection Ethics

  • Public data only. SITI only accesses publicly available web pages. No authenticated APIs, no credential harvesting, no privileged access.
  • robots.txt respected. Sites with robots.txt restrictions (Google, Meta) are only accessed where permitted.
  • No PII beyond public display. Reviewer names and profile information are only stored as they appear publicly on review platforms.
  • Rate limited. Maximum 1 request per 3 seconds per domain by default, tuned per source to avoid server load.
  • Residential proxies. Bright Data residential IP pool used for review sites to respect geographic rate limits. Direct connections used where permitted.
  • User-agent transparency. Crawlee's built-in fingerprint rotation identifies the scraper honestly; no impersonation of real browsers.
  • No silent data loss. Partial scrapes are flagged as incomplete. Three consecutive failures on any source trigger an alert. Nothing is silently dropped.
  • Audit trail. Every scrape run and every AI generation is logged with status, duration, and item count for full traceability.

Project Context

SITI is a demonstration system built by Aidju Digital to win a paid engagement with MTCP Sarawak. The Ministry does not provide data, so SITI scrapes public sources, runs AI analysis, and produces decision-grade intelligence the Ministry cannot get from standard Sarawak Tourism Board reports.

This is a pre-engagement demo. It is not a finished product. The demo's job is to demonstrate capability, not to ship at production scale. All modules render with real data or clearly marked stubs. Paid engagement scope (multi-tenant, Bahasa Melayu, real-time dashboards, public API, neural forecasting) is documented and awaits ministerial sign-off.

Last updated: 24 May 2026. Status: Demo build, feature-complete.