Databases: Guide For Beginners

DATABASES
A database is a structured collection of related data. Databases allow queries (searches), insertions (updates), deletions, and many other functions. The database is managed by the Database Management System (DBMS), which controls all access to the database and enforces the database security. Databases are managed by Database Administrators (DBAs). Databases may be searched with a database query language, such as the Structured Query Language (SQL). Typical database security issues include the confidentiality and integrity of the stored data. Integrity is a primary concern when replicated databases are updated.
Additional database confidentiality issues include inference and aggregation attacks. Aggregation is a mathematical attack where an attacker aggregates details at a lower classification to determine information at a higher classification. Inference is a similar attack, but the attacker must logically deduce missing details: unlike aggregation, a mystery must be solved.


TYPES OF DATABASES
Formal database types include relational (two dimensional), hierarchical, and object- oriented. The simplest form of database is a flat file: a text file that contains multiple lines of data, each in a standard format. A host file (located at/etc/hosts on UNIX systems, and c:\windows\system32\drivers\etc\hosts on many versions of Microsoft Windows) is an example of a flat file: each entry (line) contains at least an IP address and a host name.


Relational Databases
The most common modern database is the relational database, which contain two- dimensional tables of related (hence the term “relational”) data. A table is also called a relation. Tables have rows and columns: a row is a database record, called a tuple; a column is called an attribute. A single cell (intersection of a row and column) in a database is called a value. Relational databases require a unique value called the primary key in each tuple in a table.

Foreign Keys
A foreign key is a key in a related database table that matches a primary key in a parent database table. Note that the foreign key is the local table’s primary key: it is called the foreign key when referring to a parent table. The HR database table may be joined to the parent (employee) database table by connecting the foreign key of the HR table to the primary key of the employee table.
Referential, Semantic and Entity Integrity
Databases must ensure the integrity of the data in the tables: this is called data integ- rity, discussed in the “Database Integrity” section below. There are three additional specific integrity issues that must be addressed beyond the correctness of the data itself: Referential, Semantic, and Entity Integrity. These are tied closely to the logical operations of the DBMS.
Referential integrity means that every foreign key in a secondary table matches a primary key in the parent table: if this is not true, referential integrity has been broken. Semantic integrity means that each attribute (column) value is consistent with the attribute data type. Entity integrity means each tuple has a unique primary key that is not null.

Database Normalization
Database normalization seeks to make the data in a database table logically concise, organized, and consistent. Normalization removes redundant data, and improves the integrity and availability of the database. Normalization has three rules, called forms (see: http://www.informit.com/articles/article.aspx?p=30646 for more information):
• First Normal Form (1NF): Divide data into tables.
• Second Normal Form (2NF): Move data that is partially dependent on the
primary key to another table. The HR Database is an example of
2NF.
• Third normal Form (3NF): Remove data that is not dependent on the primary
key.


Database Views
Database tables may be queried; the results of a query are called a database view. Views may be used to provide a constrained user interface: for example, non- management employees can be shown their individual records only via database views.

The Data Dictionary
The data dictionary contains a description of the database tables. This is called metadata: data about data. The data dictionary contains database view information, information about authorized database administrators, user accounts including their names and privileges, and auditing information, among others. A critical data dictionary component is the database schema: it describes the attributes and values of the database tables.

Database Query Languages
Database query languages allow the creation of database tables, read/write access to those tables, and many other functions. Database query languages have at least two subsets of commands: Data Definition Language (DDL) and Data Manipulation Language (DML). DDL is used to create, modify, and delete tables. DML is use to query and update data stored in the tables.
The most popular relational database query language is SQL (Structured Query Language), created by IBM in 1974. Many types of SQL exist, including MySQL, PostgreSQL, PL/SQL (Procedural Language/SQL, used by Oracle), T-SQL and ANSI SQL (used by Microsoft SQL), and many others.
Common SQL commands include:
• CREATE: create a table
• SELECT: select a record
• DELETE: delete a record (or a whole table)
• INSERT: insert a record
• UPDATE: change a record
Tables are created with the CREATE command, which uses Data Definition Lan- guage to describe the format of the table that is being created. An example of a Data Manipulation Language command is SELECT, which is used to search and choose data from a table.

Hierarchical Databases
Hierarchical databases form a tree: the global Domain Name Service (DNS) servers form a global tree. The root name servers are at the “root zone” at the base of the tree; individual DNS entries form the leaves. www.syngress.com points to the syngress. com DNS database, which is part of the dot com (.com) top level domain (TLD), which is part of the global DNS (root zone). From the root, you may go back down another branch, down to the dot gov (.gov) TLD, to the nist.gov (National Institute of Standards and Technologies) domain, to www.nist.gov.
A special form of hierarchical database is the network model (referring to net- works of people, not data networks): this allows branches of a hierarchical database to have two parents (two connections back to the root). Imagine an organization’s org chart is stored in a database that forms a tree, with the CEO as the root of the hierar- chy. In this company, the physical security staff reports to both facilities (for facility issues) and to IT (for data center physical security). The network model allows the physical security staff to have “two bosses” in the hierarchical database: reporting through an IT manager and a facilities manager.

Object-Oriented Databases
While databases traditionally contain just (passive) data, object-oriented databases combine data with functions (code) in an object-oriented framework. Object-Oriented Programming (OOP) is used to manipulate the objects (and their data), managed by an Object Database Management System (ODBMS).


DATABASE INTEGRITY
In addition to the previously discussed relational database integrity issues of seman- tic, referential, and entity integrity, databases must also ensure data integrity: the integrity of the entries in the database tables. This treats integrity as a more general issue: mitigating unauthorized modifications of data. The primary challenge associ- ated with data integrity within a database is simultaneous attempted modifications of data. A database server typically runs multiple threads (lightweight processes), each capable of altering data. What happens if two threads attempt to alter the same record?
DBMSs may attempt to commit updates: make the pending changes permanent. If the commit is unsuccessful, the DBMSs can rollback (also called abort) and restore from a savepoint (clean snapshot of the database tables).
A database journal is a log of all database transactions. Should a database become corrupted, the database can be reverted to a back-up copy, and then subsequent transactions can be “replayed” from the journal, restoring database integrity.


DATABASE REPLICATION AND SHADOWING
Databases may be highly available (HA), replicated with multiple servers containing multiple copies of tables. Integrity is the primary concern with replicated databases: if a record is updated in one table, it must be simultaneously updated in all tables. Also, what happens if two processes attempt to update the same tuple simultaneously on two different servers? They both cannot be successful; this would violate the integrity of the tuple.
Database replication mirrors a live database, allowing simultaneous reads and writes to multiple replicated databases by clients. Replicated databases pose additional integrity challenges. A two-phase (or multiphase) commit can be used to assure integrity: before committing, the DBMS requests a vote. If the DBMSs on each server agree to commit, the changes are made permanent. If any DBMSs disagree, the vote fails, and the changes are not committed (not made permanent). A shadow database is similar to a replicated database, with one key difference: a shadow database mirrors all changes made to a primary database, but clients do not access the shadow. Unlike replicated databases, the shadow database is one- way (data flows from primary to shadow): it serves as a live data backup of the primary.

DATA WAREHOUSING AND DATA MINING
As the name implies, a data warehouse is a large collection of data. Modern data warehouses may store many terabytes (1,000 gigabytes) or even petabytes (1,000 terabytes) of data. This requires large scalable storage solutions. The storage must be high performance, and allow analysis and searches of the data.
Once data is collected in a warehouse, data mining is used to search for patterns. Commonly sought patterns include signs of fraud. Credit card companies manage some of the world’s largest data warehouses, tracking billions of transactions per year. Fraudulent transactions are a primary concern of credit card companies that lead to millions of dollars in lost revenue. No human could possibly monitor all of those transactions, so the credit card companies use data mining to separate the signal from noise. A common data mining fraud rule monitors multiple purchases on one card in different states or countries in a short period of time. A violation record can be pro- duced when this occurs, leading to suspension of the card or a phone call to the card owner’s home.