Day 25, database

Today is the second day of the Node.js fullstack sprint, but El and I were done with it last night. Hence, we started working on the next sprint – the database sprint.

The last several sprints are built on top of each other. It is amazing. Initially, we built a chat client, with Parse.com as the backend. Then, we build a chat server with Node.js, replacing Parse.com. Then, we persisted our server side chat log to file system. Now, we are going to persist the server side chat log to databases.

It is a big hassle to talk to file system directly. Some drawbacks include:

– have to read / write the whole thing (we could break files into smaller files or try reading the file by index)

– no abstraction layer

– conflicts when the same file is read and written by multiple users at the same time

Hence, we need an abstraction layer on top of the file system. That system needs to:

– deal with tons of files and provide persistent storage.

– allow read from middle of the file and allow random access small piece of info in the middle of the file.

– keep a index of data locations

– data chunks have a fixed size

– provide speedy lookup, ideally in constant time

– support multiple programming language

– accept connection protocols similar to HTTP protocols

We briefly talked about SQL vs NoSQL. I am not very happy with the lecture. I feel like we over simplified the concept of a SQL database and created unnecessary confusion.

The purpose of index in a database:

– I had this faint idea of how index works and now I think I know better.

– Index, similar to the index pages in the back of the book, for example, the word “Long March” can be found in page 33.

– Image we have a table with 100 rows:

id, website, timestamp
1, google.com, day1
2, amazon.com, day2
3, ebay.com, day3
...
100, baidu.com, day5

– without an index, if we need to find out the timestamp for baidu.com, we would have to traverse all records before finding the result. The complexity would be O(n).

– There are multiple steps in using an index.

Step 1 is to build the index. We need to traverse the entire table and build a index. It is O(n) but we only need to do this once.

My native thoughts are index might be similar to a HashTable, hence

myIndex['google.com'] = 1;
myIndex['amazon.com'] = 2;
...
myIndex['baidu.com'] = 100;

Step 2 is to lookup the location from the index. Hopefully index behaves like a HashTable, with constant look up time. The time complexity here would be O(1).

In this case, we are looking for ‘baidu.com’, and it would return a location of 100

Step 3 is to calculate the location of the real record. Assuming each record occupy the same size, the location of “baidu.com” is: 100 * size_of_the_record

Step 4 is to visit the location of “baidu.com” directly, without visiting any of the other records.

Again, my very native understanding of how index would speed up lookups.

Relational vs non-relational databases:

Relational comes from relational algebra, a branch of mathematics dealing with sets. It makes sense since sets and tables are very similar. For example, they are unique.

SQL questions are common in job interviews. Basic syntax is SFW – select * from * where. Common topics are: normalization, schema design and foreign keys.

Types of collections between tables: 1 to 1, 1 to many, and many to many.

It is not a good idea to store arrays in a column because it is hard to index.

ORMs:

ORMs enable people to use SQL, but they might produce bad or slow SQLs.

Mongo:

– technology.

— Schemaless / doc oriented. Good for start-ups who change their schema all the time.

— auto-scaling / shading. However, auto-scaling only works with primary indices. At a large scale, you have to choose auto-scaling or secondary indices. Primary index means index based on the primary key(???).

— won’t tell you if a write fails

– user experience

– fashion

The Twitter dilemma. Natively, there are two tables.

// UserTable
user_id, user_name
1, shao
2, js
3, python

// TweetTable
tweet_id, tweet_text, timestamp, user_id
11, tweet1, time1, 1
12, tweet2, time2, 2
13, tweet3, time3, 1

//FollowTable
id, user_id, following_user
1, 1, 2
1, 1, 3

// to build shao's timeline, we would write
// pretty sure this is NOT the right syntax
SELECT * from TweetTable where user_id in (SELECT following_user from FollowTable where user_id = 1)

The problem is we need to re-build the entire index every time a new Tweet is created and the users are expecting results to be returned at almost real time. Enough for now, but I need to study more on this.

Advertisements
Standard