DEV Community: Raja Upadhyay

Storage and Retrieval

Raja Upadhyay — Tue, 21 Jun 2022 06:31:35 +0000

Indexes are used in databases to speed up reads. Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.

A clustered index stores all row data within the index.
A non-clustered index only stores the references (to heap file) to the data within the index.
A covering index stores some of a table's columns within the index. This allows some queries to be answered using the index alone.

Indexes duplicate data and although they speed up reads, they require additional storage and can add overhead on writes.

Multi-column indexes enables efficient querying of multiple columns at once. e.g. a multi-column index on (date, temp) can be used to efficiently find all observations in 2013 where the temp was between 25 and 30 degrees. With a 1D index, you would have to either scan all records from 2013 and then filter (or vice vera).

In memory databases

The performance advantage of in-memory databases is not due to the fact that they don't need to read from disk - even a disk-based storage engine may not need to read from disk if there is enough memory since the OS caches recently used disk blocks in memory anyway. The performance advantage comes from not having to encode in-memory data structures in a format that can be written to disk.

OLTP vs OLAP access pattern

OLTP (Online Transaction Processing) is an access pattern that is widely used in interactive applications where records are inserted/updated based on user input e.g. actions in a game.

OLAP (Online Analytics Processing) is an access pattern used for analytic queries and business intelligence.

Data Models and Query Languages

Raja Upadhyay — Mon, 09 May 2022 20:07:10 +0000

In complex applications, APIs could be built upon APIs. Each data model hiding the complexity of the layers below it by providing a clean data model.

One of the best data models known today is SQL, based on the relational model.

Most application development today is done in object-oriented programming languages and therefore a translation layer is sometimes required between the objects in the application code and the database model of tables, rows and columns - aka impedance mismatch. This is where object-relational mapping is used (e.g. of ORM is sqlalchemy in python).

Using document stores or JSON representation can sometimes help reduce the impedance mismatch but this data model also has its disadvantages (to be explored further in later chapters).

JSON representation has better locality than multi-table relational models. e.g. to obtain a piece of data you might need to perform multi-way joins in a relational database whereas in a document model, all the relevant information is in one place.

Relational vs Document Databases

Document databases offer schema flexibility and better performance due to locality. Relational databases allow you to perform joins and support many-to-one and many-to-many relationships.

If your application needs to access "entire" datasets then the locality of document databases can be advantageous. If data is split across multiple tables, multiple index lookups and joins can require more disk seeks and therefore add time.

On the flip side, document databases typically load the entire document even if you only want to access a small portion of it.

NB: Locality is not a feature offered only by document databases. Google Spanner DB offers locality properties by allowing schema to declare that a table's rows should be interleaved within a parent table. Other such offerings also exist.

Query Languages for Data

Imperative programming languages describes the exact operations and the order of those operations (e.g. Python, JS etc). Declarative languages (e.g. SQL) on the other hand just describe the data you want, any conditions the results must meet and any transformations (e.g. group by). It is up to the database system's query optimiser to decide how to achieve the result (using joins, indexes etc).

Declarative languages have a better chance of faster parallel execution since they only specify the pattern of results and not the actual algorithm to obtain the results.

(summary of chapter 2 from Designing Data-Intensive Applications by Martin Kleppmann)

Reliability, Scalability and Maintainability

Raja Upadhyay — Thu, 24 Mar 2022 19:31:30 +0000

Reliability: The system should continue working correctly even when things go wrong e.g. the system should be able to deal with the user making mistakes.
Scalability: The system should be able to serve it's users as the userbase grows.
Maintainability: Many people will work on a given system and this should be as productive as possible.

Load
We can define the load on a system using load parameters and this varies from system to system.
Examples:

Requests per second to a web server
Number of concurrent users
Hit rate on a cache

Latency vs Response Time
Response time is what the client sees (the time taken to process the request). Latency is the duration during which a request is waiting to be handled (awaiting service).

When discussing response times, using percentiles is better than just describing the average response time. The average does not provide an idea of how many users actually experience that delay. High percentile of response times (e.g. 99.9th percentile) aka tail latencies are important as they affect the users experience of the service.

Percentiles are often used in SLAs.

Coping with load

Vertical scaling: moving to a more powerful machine
Horizontal scaling: distributing load across multiple machines

Elastic systems involve scaling horizontally once an increase in load is detected.

Maintainability
A maintainable system is one that is easy to operate (easy for ops teams to keep the system running smoothly), simple to understand for new engineers and is easily extensible or evolvable.

(summary of chapter 1 from Designing Data-Intensive Applications by Martin Kleppmann)