Design Mint.com
Use case: service extracts transactions from account What is data flow?
- Client sends request to web server - Web server forwards request to Accounts API server - Accounts API server places a job on a Queue e.g. Amazon SQS/RabbitMQ - Extracting transactions can take a while so want to do this asynchronously with a queue although this introduces additional complexity - Transaction Extraction Service does following: - Pulls from Queue and extracts transactions for given account, storing results as raw log files in the Object Store - Uses the Category Services to categorize each transaction - Uses Budget Service to calculate aggregate monthly spending by category. Budget Service uses the Notification Service to let users know if they are nearing/exceeded the budget - Updates the SQL DB "transactions" table with categorized transactions and the "monthly_spending" table with aggregate monthly spending by category - Notifies user transactions have completed through Notification Service: uses a queue to asynchronously send out notifications
Use case: user connects to financial account
- Client sends request to web server running as reverse proxy - Web server forwards requests to Accounts API server - Accounts API server updates the SQL DB "accounts" table with new info Can create an index on "id", "user_id", and "created_at" to speed up lookups (log time) and to keep the data in memory. Can use REST API and RPC for internal comms
Mint Assumptions
- Traffic not evenly distributed - Automatic daily update of accounts applied only to users active in past 30 days - Adding/removing financial accounts is rare - Budget notifications don't need to be instant - Write-heavy system - Need to pull data from banks (rather than banks pushing)
Mint Use cases
1) User connects to a financial account 2) Service extracts transactions from the account Updates daily Categorizes transactions - Allows manual category override by the user - No automatic re-categorization Analyzes monthly spending, by category 3) Service recommends a budget Allows users to manually set a budget Sends notifications when approaching or exceeding budget 4) Service has high availability
Mint numbers/constraints
10M active users 10 budget categories/user = 100M 30M financial accounts 5B transactions/month 500M read requests/month 10:1 write: read ratio
How to handle 200 average transaction writes/second
200 average transaction writes per second (higher at peak) might be tough for a single SQL Write Master-Slave. We might need to employ additional SQL scaling patterns e.g. Federation, Sharding, Denormalization and SQL Tuning. Should consider moving data to a NoSQL DB
Use case: service recommends a budget - better attempt
As an alternative, could run MapReduce jobs on the raw transaction files to: 1) Categorize each transaction 2) Generate aggregate monthly spending by category - Running analyses on the transaction files could significantly reduce the load on the DB. - Could call the Budget Service to re-run the analysis if the user updates a category.
Draw major components for naive system where it pulls bank data on each login, categorizes all data and analyzes user's budget.
Bank data pulled periodically (depending on user behavior). One new data arrives, it is stored in raw, unprocessed transactions. Data then pushed to categorizer, which assigns each transaction to category and then stores categorized transactions in another datastore. Budget analyzer pulls in categorized transactions, updates each user's budget/category and stores user's budget. Frontend pulls data from both categorized transactions datastore and from budget datastore. User could interact with frontend by changing budget/categories of transactions
Scaling design details - cache relatd
Can refer to 'when to update cache' for tradeoffs and alternatives e.g. cache-aside. Instead of keeping the monthly_spending aggregate table in the SQL DB, we could create a separate Analytics DB using a data warehousing solution e.g. Amazon Redshift or Google BigQuery. Might only want to store a month of transactions data in the DB, while storing the rest in a data warehouse or in an Object Store. An Object Store (e.g. Amazon S3) can comfortably handle the constraint of 250 GB of new content/month. To address the 2K average read requests/second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the DB. Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. SQL Read Replicas should be able to handle the cache misses, as long as the replicas not bogged down with replicating writes.
How does categorizer work?
Category for seller might be stored in cache for common sellers and then applied category to all those transactions. After applying category, it re-groups transactions by users then transactions inserted into datastore for user.
Issue 3: Massive amount of data that needs to be pulled and analyzed
Fetch bank data asynchronously and run tasks across many servers.
Category service
For Category service, we can seed a seller-to-category dictionary with the most popular sellers. If we estimate 50K sellers and estimate each entry to take <255 bytes, the dictionary would only take about 12MB of memory. For sellers not initially seeded in the map, we could use a crowdsourcing effort by evaluating the manual category overrides our users provide. We could use a heap to quickly lookup the top manual override/seller in O(1) time.
How to store transactions?
Might be better to store transactions as text files. Can group transaction files by seller's name and take advantage of these duplicates.
Usage back of envelope
Size per transaction = 50 bytes per transaction (due to user ID, seller, amount) 5 billion transactions/month = 250 GB of new transaction content/month 250GB * 36 = 9 TB 2k transactions / second on average so 200 read requests/second
What does budget analyzer do?
Takes data grouped by user merges it across categories and then updates budget. Most of these tasks handled in simple log files . Only final data (categorized transactions & budget analysis) will be stored in db. Minimizing writing and reading from db.
Issue 1: Data-heavy system so make processing as asynchronous as possible
Task queues: task like pulling in new bank data, reanalyzing budgets and categorizing bank data. Also re-trying tasks that failed. Task queue system can take into account priorities with tasks.
Use case: service recommends a budget - 1st attempt
To start, we could use a generic budget template that allocates category amounts based on income tiers. This approach, would not have to store the 100M budget items identified in the constraints, only those that the user overrides. If a budget category, which we could store the override in the TABLE budget_overrides For the Budget Service, we can potentially run SQL queries on the transactions table to generate the monthly_spending aggregate table. The monthly_spending table likely to have much fewer rows than the total 5B transactions since users typically have many transactions/month.
What if user manually changes categories?
Update datastore for categorized transactions. It would signal a quick recomputation of the budget to decrement item from old category and increment item in other category.
Scaling design details for user access summaries and transactions
User sessions, aggregate stats by category and recent transactions could be placed in a Memory Cache e.g. Redis or Memcached. - Client sends a read request to the Web Server - Web Server forwards the request to the Read API server - Static content can be served from the Object Store e.g. S3, which is cached on the CDN. - Read API server does the following: - Checks Memory Cache for the content - If the URL is in Memory Cache, returns the cached contents - Else - If URL is in SQL DB, fetches the contents - Updates Memory Cache with contents
Issue 2: Email system to regularly crawl user's data to check if they're exceeding budget but that means checking single user daily.
Want to queue a task whenever transaction occurs that potentially exceeds a budget. Can store current budget totals by category to make it easy to understand if new transaction exceeds budget.
Use case: service extracts transactions from account When?
We'll want to extract information from an account in these cases: - user first links account - user manually refreshes account - Automatically each day for users who have been active in past 30 days