Who should read this
Summary: SaaS is not “building software.” It’s running a service. Unlike packaged software you sell once and forget, SaaS means keeping contracts alive as subscriptions, keeping infrastructure up 24/7, keeping tenant data isolated, and shipping continuously. This guide walks the full lifecycle of building and running a SaaS from zero, organized into nine stages: ideation, design, development, infrastructure, launch, operations, growth, organization, and a stage-by-stage checklist.
This piece is for zero-to-one founders, CTOs and lead engineers scaling from 1-to-10, and established software teams evaluating a SaaS transition. Rather than covering every topic in depth, it focuses on the decisions that must be made at each stage and how they cascade into later stages.
0. Opening — what SaaS actually is
Three things are at the core of SaaS.
First, multi-tenancy. Many customers share one system, but their data must be perfectly isolated from one another. How this is designed decides scalability, cost, and security all at once.
Second, a subscription revenue model. Money doesn’t come in once; it recurs monthly or annually. Keeping customers from leaving (retention) matters as much as acquiring new ones. Metrics like MRR, churn, and LTV become the vital signs of the company.
Third, continuous delivery. Customers are always on the latest version. Versioning, compatibility, rollback, and data migration are daily concerns.
1. Ideation — what to sell, to whom, and why
1.1 Problem discovery and market validation
A good SaaS always starts from a problem that is recurring, tied to money, and currently painful to solve. Before writing code, validate:
- Painful: Will customers actually pay, or is it a nice-to-have?
- Frequent: Does it happen often? A once-a-year problem won’t support a subscription.
- Urgent: Does it have to be solved now, or can it wait?
- Underserved: Are existing tools insufficient, expensive, or unpleasant to use?
Validation methods are old-fashioned but still work. At least 30 prospect interviews (fewer is too few to trust), scraping competitor reviews, harvesting complaints on Reddit/Slack/LinkedIn, and above all checking intent to pre-pay. “I’m interested” lies often. A card pulled out of a wallet tells the truth.
1.2 Defining your ICP (Ideal Customer Profile)
“A product for everyone” serves no one. Make the ICP concrete along at least these axes:
- Company size (headcount, revenue)
- Industry
- Role / title (the buyer and the user may be different people)
- Existing tool stack
- Budget authority structure
- Severity and frequency of the pain
For B2B, draw four separate personas: Buyer, User, Champion (internal advocate), and Gatekeeper (security, legal — the ones who can say no). These are rarely the same person.
1.3 Value proposition and positioning
April Dunford’s positioning frame is the most practically useful I’ve used:
- Competitive alternatives — what the customer uses instead (could be a competitor, could be Excel, could be manual work)
- Unique attributes — what you have that they don’t
- Value — what those attributes actually deliver
- Target — the customer who wants that value the most
- Market category — the bucket you want to be sorted into
Skip this and your landing page headline, sales pitch, and ultimately your product direction all drift.
1.4 Business model and pricing
Main SaaS pricing models:
- Per-seat: Slack, Notion. Scales with users. Simple, but customers hoard logins.
- Usage-based: AWS, Twilio, OpenAI API. Scales well but invoices are hard to predict.
- Tiered: Free / Pro / Business / Enterprise. Most common and stable. The upgrade triggers between tiers have to be clear.
- Flat fee: Works for small tools, but caps expansion.
- Hybrid: Per-seat + usage, base + usage. In practice, the most common.
Pricing principles:
- Value-based pricing. Price against the value delivered, not cost. If you save a customer $10K/month, $500/month is a bargain.
- 3-tier psychology. With three plans, the middle one sells most (anchoring). Put the plan you really want to sell in the middle.
- Annual discount. 15–20% off annual pricing improves cash flow and retention at once.
- Enterprise = Contact Sales. Don’t publish the top-tier price. Leave room to negotiate, and capture the price-insensitive big accounts.
1.5 MVP scope and roadmap
MVP’s operative word is Viable, not Minimum. Something a customer can extract real value from and pay for.
Useful rules for scoping an MVP:
- Walking Skeleton. Build a thin end-to-end version from day one: sign up → one core feature → payment.
- Name your No-GO list. Writing down what you’re not building blocks scope creep.
- 6–12 weeks. Longer than that is no longer an MVP.
A Now / Next / Later roadmap works better in practice than a dated Gantt chart. Now is committed, Next is direction, Later is hypothesis.
2. Product design
2.1 Requirements analysis
Separate functional from non-functional requirements. In SaaS, non-functional requirements matter disproportionately.
Non-functional checklist:
- Availability: Target SLA. 99.9% = 43 min/month down, 99.95% = 21 min, 99.99% = 4 min.
- Performance: p50/p95/p99 latency targets.
- Scalability: Concurrent users, requests per second, data growth rate.
- Security: Authentication, encryption, access control.
- Compliance: GDPR, SOC 2, ISO 27001, HIPAA — whatever your target market requires.
- Observability: Logs, metrics, traces.
- Data residency: Does data have to live in a specific country?
2.2 UX research and design
Two axes for SaaS UX:
- First-run experience (onboarding): How fast does a new user hit the “Aha moment”? Shorter Time-to-value (TTV) means higher conversion.
- Daily usage: How fast and efficient is the main daily flow?
States you’ll design, whether you plan for them or not:
- Empty state (no data yet)
- Loading state (skeletons)
- Error state (messages and recovery paths)
- Permissions state (when a user lacks access)
- Paywall / upgrade state (when a plan limit is hit)
Adopt a design system early. Tailwind + shadcn/ui, Radix, Material — pick one. Unifying styles after they’ve spread is nearly a rebuild.
2.3 System architecture
Monolith vs microservices. Early on, almost always the monolith. Shopify, GitHub, and Basecamp all built as monoliths and, in many cases, still are. Split only after ~20 people and domains with genuinely different release cadences. “Majestic Monolith” exists as a phrase for a reason. Before deciding, it’s worth walking through the modular monolith vs microservices trade-off once.
Sync vs async. Payments, email, report generation, external API calls, and large batches all go through worker queues. Sidekiq, Celery, BullMQ, AWS SQS + Lambda, RabbitMQ — pick one and stick with it. The pattern is “user request → immediate 200 → background processing → notification on completion.”
2.4 Multi-tenancy strategy (the core SaaS decision)
Three models, and the choice rewrites everything downstream.
1. Shared DB, shared schema (fully shared).
- All tenants, same DB, same tables. Every row has a
tenant_id. - Pros: Best operational/cost efficiency. One schema change reaches everyone.
- Cons: Isolation depends entirely on application code. Forget
WHERE tenant_id = ?once and you leak data. - Fits: early-stage startups, SMB targets.
- Must-haves: PostgreSQL Row-Level Security (RLS), ORM-level forced scoping (Rails
default_scope, Django Manager), periodic audit queries.
2. Shared DB, separate schema.
- One DB, one schema per tenant (using PostgreSQL schemas).
- Pros: Middle-ground isolation. Schema-level backup and restore.
- Cons: Becomes hard to manage past a few thousand tenants. Migrations get messy.
- Fits: Mid-market, mid-sized company targets.
3. Separate DB (full isolation).
- A DB instance or cluster per tenant.
- Pros: Maximum isolation. Performance isolation, data residency, dedicated handling for large customers.
- Cons: Cost and operational complexity explode. N migrations.
- Fits: Enterprise, healthcare, finance.
In practice, hybrid is the norm. Shared for SMB, dedicated DB/infrastructure for enterprise. To allow this, abstract the tenant-routing layer (which tenant lives on which DB) from the start. The detailed design of roles and permissions inside a tenant is covered separately in multi-tenant permissions architecture.
2.5 Data model principles
- Every table has
tenant_id,created_at,updated_at,deleted_at(soft delete). No exceptions. - UUID vs autoincrement ID. Externally exposed IDs should be UUIDs (UUIDv7 is index-friendly). Internal PKs as bigint are better for performance.
- Audit log: Who changed what when. Essential for enterprise sales.
- Soft delete: Needed for customer mistakes, GDPR, and regulatory recovery.
- Split time-series/event tables. High-write tables belong outside transactional tables.
2.6 API design
- REST as default. GraphQL when the frontend is complex and has many clients. gRPC for internal service-to-service.
- Versioning:
/v1/,/v2/in URLs is clearest. Header versioning hurts caching and operations. - Pagination: Cursor-based (
?cursor=xxx) scales better than offset. - Idempotency-Key: Critical POSTs (payments) must be safe to retry.
- Rate limiting: Default-on for every endpoint. Differentiated by plan.
- Webhooks: Give customers your events. Implement retries, HMAC signatures, replay protection.
- Official SDKs: At minimum JavaScript and Python up front. A TypeScript-typed SDK meaningfully raises developer-experience quality.
2.7 Security and compliance design
Day-one basics:
- Encryption in transit: HTTPS / TLS 1.2+ only.
- Encryption at rest: DB and disk encryption. Sensitive fields (PII, payment, tokens) get an extra layer at the application level (KMS).
- Secrets management: No env-var hardcoding. AWS Secrets Manager, HashiCorp Vault, Doppler.
- RBAC: Tenant roles (Owner, Admin, Member, Viewer). Enterprise customers will eventually demand ABAC or granular permissions.
- Audit logs: Logins, permission changes, data access, deletions — all recorded.
- Session management: Expiry, concurrent-session limits, suspicious-login alerts.
Compliance depends on your market:
- GDPR (EU): Consent, data export, right to be forgotten.
- SOC 2 Type II: Effectively required in B2B SaaS. Needs at least 6–12 months of operating history.
- ISO 27001: For global enterprise sales.
- HIPAA: Medical data.
- PCI DSS: If you handle card data directly (offload to Stripe and you can stay at SAQ-A).
Using Vanta, Drata, or Secureframe from the start is now standard for SOC 2. Doing it by hand costs engineer-months.
3. Development
3.1 Tech stack selection
The criterion is “can this team ship fast?” — not “is this fun?”
Common combinations:
- Full-stack productivity: Rails, Django, Laravel, Next.js + Node/Bun.
- Type safety: TypeScript + NestJS, Go, Kotlin.
- DB: PostgreSQL is almost always right. MySQL works. NoSQL only with a specific reason.
- Cache/queue: Redis.
- Search: PostgreSQL full-text early, Elasticsearch / OpenSearch / Meilisearch at scale.
- Frontend: React + TypeScript is the safest ecosystem bet. Vue, Svelte work too.
- Infra: AWS/GCP/Azure — whichever the team knows. Early on, PaaS like Vercel, Render, Fly.io, Railway is a strong choice.
3.2 Auth/authN
Build vs buy.
Building is reasonable for basic email/password plus social login. Services like Auth0, Clerk, WorkOS, Supabase Auth make sense when you need SSO (SAML, OIDC), SCIM (auto-provisioning), MFA, and enterprise features.
Enterprise customers will ask for SAML SSO and SCIM almost the moment they arrive. Building that yourself is an engineer-months project. That’s why WorkOS became the standard in this space.
Token strategy:
- JWT: Good for stateless APIs. Hard to revoke, so pair short lifetimes (15 min) with refresh tokens.
- Session cookies: Best for web apps. Server-side revocation is easy.
- Many SaaS ship a hybrid: session cookies for the web app, JWT or API keys for the API.
3.3 Billing and subscriptions
Stripe is effectively the standard. Delegate payment to Stripe but mirror subscription state and plan/quota in your own DB.
What you have to build:
- Plan definitions and pricing
- Checkout (Stripe Checkout or Elements)
- Webhook handling (
invoice.paid,customer.subscription.updated,customer.subscription.deleted, etc.) - Idempotency — webhooks can arrive multiple times
- Dunning for failed payments
- Proration on plan changes
- Refunds, credits
- Tax (Stripe Tax, Avalara)
- Invoicing, VAT, business number collection
- Trials, coupons, referral discounts
3.4 Testing strategy
SaaS needs these layers:
- Unit tests: Business logic.
- Integration tests: DB and external-API-inclusive features.
- E2E tests: Playwright or Cypress for primary user flows.
- Contract tests: Pact, etc., to protect external API contracts.
- Performance tests: k6, Locust for load.
- Security tests: SAST, dependency vuln scanning (Dependabot, Snyk), DAST, periodic pentests.
Coverage numbers matter less than confidence in the critical flows. Protect payments, authorization, and tenant isolation with automated tests specifically.
3.5 Multi-tenancy isolation in practice
The single most common disaster is a missed tenant scope. Layered defense:
- Application level: Every query auto-scoped by
tenant_idvia ORM middleware/scope. - DB level: PostgreSQL Row-Level Security (RLS) on. If the app code slips, the DB still catches it.
- Tests: “When logged in as Tenant A, accessing Tenant B’s resource ID returns 404/403” — for every resource, automated.
- Audit queries: Periodic “find rows where tenant_id is NULL or wrong.”
4. Infrastructure
4.1 Cloud strategy
AWS, GCP, Azure — practical criteria:
- AWS: Widest service footprint and ecosystem. Initial setup is verbose.
- GCP: Strong in data/ML, cleaner UX. Enterprise perception still trails AWS.
- Azure: Best for enterprise / MS-centric customers.
A startup-lens comparison of the three is in AWS vs GCP vs Azure for startups; for frontend PaaS, see Vercel vs Netlify vs Cloudflare Pages.
Early-stage startups can happily start at the PaaS layer:
- Vercel (Next.js deploys)
- Render, Railway, Fly.io (general containers)
- Supabase, Neon (Postgres)
- Cloudflare (CDN, R2, Workers, D1)
These compress time dramatically. Scale or cost will eventually push you down to IaaS, but not on day one.
4.2 Containers and orchestration
Docker is effectively required. Kubernetes should be delayed as long as possible. ECS, Cloud Run, App Runner, Fly Machines are plenty early on.
Signals that you need K8s:
- Services exceed ~10
- Team exceeds ~20
- Enterprise customers request “Helm chart for on-prem”
4.3 CI/CD
The baseline pipeline:
PR created → lint → unit/integration tests → build → preview deploy → review → merge
merged → staging deploy → E2E tests → (manual/auto) production deploy
Deployment strategies:
- Rolling: Most common. Zero downtime.
- Blue-Green: Switch between two environments. Fast rollback.
- Canary: 1% → 10% → 100%. Standard at scale.
- Feature flags: Decouple deploy from release. LaunchDarkly, Unleash, PostHog flags.
4.4 Observability (the three pillars)
Logs. Structured JSON. tenant_id, user_id, request_id mandatory on every line. Datadog, Grafana Loki, CloudWatch.
Metrics. System (CPU, memory, DB connections), application (request count, latency, error rate), business (signups, payments, MRR). Prometheus + Grafana or Datadog. A side-by-side comparison is in Datadog vs Grafana vs New Relic.
Traces. How a request flowed through services. OpenTelemetry is the standard. Datadog APM, Tempo, Honeycomb.
Error tracking: Sentry, effectively standard.
Uptime: Better Stack, Pingdom, UptimeRobot.
4.5 Security hardening
- All infra as Infrastructure as Code (Terraform, Pulumi).
- VPC isolation, minimize public subnets.
- Security groups / firewalls by least privilege.
- DB never public.
- IAM roles by least privilege; avoid long-lived access keys.
- Secrets in Vault / Secrets Manager.
- WAF: Cloudflare, AWS WAF.
- DDoS defense at the CDN layer.
- Dependency scanning: Dependabot, Renovate.
- Container image scanning: Trivy, Grype.
4.6 Backup and disaster recovery (DR)
RPO (Recovery Point Objective): How much data loss is acceptable (e.g. 15 min). RTO (Recovery Time Objective): How long to restore (e.g. 1 hour).
- Automated DB backups + Point-in-time Recovery (PITR) for at least 7 days, typically 30.
- Backups replicated to a separate region/account.
- Runbooks per scenario: “primary DB failure,” “entire region down,” “accidental deletion.”
5. Launch
5.1 Beta and staged rollout
- Closed Alpha: Team + 5–10 friendly customers. Breakage is fine.
- Closed Beta: 50–200 invite-only. Real data, real usage. Active feedback channels.
- Open Beta / Early Access: Public waitlist, coupons.
- GA (General Availability): Official launch.
Define exit criteria for each stage — “when this is true, we advance.”
5.2 Onboarding design
The goal is to compress Time-to-Value.
- Sign-up as few fields as possible, social login included.
- Sample data (sample project) on first login.
- Checklist-style onboarding (Linear, Notion).
- Clear next action from every empty state.
- Measure and shrink “time to Aha.”
If you’re going Product-Led Growth (PLG), onboarding quality dominates conversion.
5.3 Go-to-Market
- PLG: Free → Pro → Team. Slack, Notion, Figma, Linear. Suits SMB / developer-centric SaaS.
- SLG (Sales-Led Growth): Outbound + inbound → demo → POC → contract. Enterprise / high-ACV products.
- Hybrid: PLG for breadth, SLG for enterprise expansion. Most modern B2B SaaS.
Early marketing channels:
- Content SEO (blog, comparison pages, alternatives pages)
- Communities (Reddit, HN, Dev.to, LinkedIn)
- Product Hunt
- Partnerships (integration directories)
- Founder personal brand
5.4 Support and CS
- Founders do support in the early days. No exceptions. It’s the densest product insight you’ll ever get.
- Tools: Intercom, HelpScout, Zendesk, Plain, Front.
- Knowledge base: start early. The third time a question comes up, document it.
- SLA per plan (Free: 48h, Pro: 12h, Enterprise: 2h).
6. Operations
6.1 Reliability (SRE)
- SLI (Service Level Indicator): What you measure (availability, latency).
- SLO (Service Level Objective): Internal target (availability 99.95%).
- SLA (Service Level Agreement): Contractual promise (usually a little looser than SLO).
Error Budget: 100% - SLO is your permissible failure budget. Burn it and you stop shipping new features and invest in reliability. This is the core Google SRE idea.
6.2 Incident management
Outages happen. Preparation is everything.
- Detection: Alerts, customer reports.
- Severity classification: Sev1 (total outage), Sev2 (major feature), Sev3 (partial), Sev4 (minor).
- Response (on-call): PagerDuty, Opsgenie, Better Stack. Rotations.
- Communication: Real-time status page (Statuspage, Instatus). Separate emails for enterprise customers.
- Recovery.
- Postmortem: Blameless. Fix the system, not the person. Publish root cause and the mitigations.
6.3 Release management
- Default to daily or weekly releases. “No Friday deploys” is tradition, but with a reliable CI/CD and feature flags it’s not a hard rule.
- Decouple deploy from launch with feature flags. Code is deployed; the feature is flipped on gradually.
- Minimize breaking changes; when unavoidable, announce, add deprecation windows, and write a migration guide.
- Public changelogs (Linear, Headwayapp).
6.4 Data operations
- Query performance monitoring: slow query log, pg_stat_statements.
- Periodic index review.
- Track data growth and project capacity.
- Archive strategy (old data into cold storage).
- Migrate back-to-front: shadow write → backfill → read switch → drop old table.
6.5 Finance ops
- MRR (Monthly Recurring Revenue): The SaaS vital sign.
- ARR: MRR × 12.
- Churn rate: Monthly under 1% is healthy for SMB; enterprise is under 5% annual.
- Net Revenue Retention (NRR): Revenue retention with upsell/downsell/churn. Above 100% means the product grows organically from existing customers. 120%+ is top-tier.
- Gross Margin: 70–80% is the SaaS norm. Lower means infra or external API (OpenAI) costs need to be restructured.
- CAC and LTV. The formulas are LTV/CAC > 3 and CAC Payback < 12 months.
- Rule of 40: Growth rate + operating margin ≥ 40. The public-SaaS health bar.
7. Growth and expansion
7.1 Data-driven decisions
- Product analytics: PostHog, Amplitude, Mixpanel. Funnels, retention, cohorts.
- Session replay: PostHog, Fullstory, Hotjar.
- Data warehouse: BigQuery, Snowflake, Redshift. Don’t query production for analysis.
- ELT: Fivetran, Airbyte replicate operational DB → DW. dbt models.
- BI: Metabase, Looker, Mode.
Internalize a “hypothesis → experiment (A/B) → measure → learn” loop. Weekly experiment review.
7.2 Retention and expansion
Retaining a customer is typically 5–10× cheaper than acquiring one.
- Activation: Drive the key action in the first 24–72 hours.
- Habit loop: Drive weekly return usage.
- In-app messaging: Prompts when key features sit unused.
- Churn prediction: Declining usage, shrinking team, payment failure as signals.
- Expansion: Seats, feature upsells, plan upgrades.
7.3 Going enterprise
When enterprise customers start to appear, you’ll need:
- SSO (SAML, OIDC)
- SCIM (auto-provisioning)
- Audit log API
- Granular custom roles
- Data residency (EU data in EU)
- Custom MSA, DPA
- SOC 2, ISO 27001 certificates
- Dedicated CSM (Customer Success Manager)
- 99.9%+ SLA
- Sandbox / Staging environments
That package is the “Enterprise Plan.”
7.4 Internationalization
- i18n: No hardcoded strings from day one. Translation keys.
- l10n: Not just translation — date/currency/number formats, RTL.
- Billing currency and tax (VAT, GST).
- Regional infrastructure: US / EU / APAC regions.
- Legal: Privacy, tax, local entity.
8. Organization and culture
8.1 Early team composition
Typical 0–10 person makeup:
- 2–3 co-founders (product, tech, sales — one each)
- 2–4 full-stack engineers
- 1 designer
- 1 support/CS (often a founder early on)
Roles added between 10–30:
- Product Manager
- DevOps / SRE
- Sales, marketing
- Data analyst
8.2 Development process
- 2-week sprints or Kanban. Early-stage startups often flex better with Kanban.
- Weekly product review: Metrics and learnings.
- Writing culture: RFCs / design docs. Shared by Stripe, Amazon, GitLab.
- On-call rotation: Every engineer owns the operational consequences of their code.
9. Stage-by-stage checklist
Ideation
- Problem validated (30+ interviews)
- ICP defined
- 3 pricing models compared, one chosen
- MVP scope and No-GO list set
Design
- Multi-tenancy model decided
- Non-functional requirements documented
- Data model and API designed
- Security and compliance targets set
Development
- Auth and billing in place
- Tenant isolation covered by automated tests
- Observability instrumented
- Test automation pipeline running
Infrastructure
- All envs managed as IaC
- CI/CD pipeline live
- Logs, metrics, traces, error tracking connected
- Backup + recovery rehearsal completed
Launch
- Onboarding flow tuned
- Status page live
- Support channels and SLAs defined
- Coupon/trial policy set
Operations
- On-call and incident process
- Weekly SLO review
- Postmortem template
- Weekly MRR/Churn/NRR report
Growth
- Product analytics and DW online
- Experiment process running
- Retention program live
- Enterprise package (SSO, SCIM, audit log, security certs)
Closing
There’s no such thing as a perfect SaaS. What separates the teams that survive is the habit of regularly asking themselves, at each of these stages, “are we doing this well right now?”