Introduction to Big Data
Have you ever stopped to think about the amount and variety of data that we generate and store each day? Banks, airlines, telecom operators, search service, online retailers and networks are just some of the numerous examples of companies that live daily with large volumes of information. The issue is that only having data is not enough: it is important to get to know and use them. This is where the concept of Big Data comes into scene.
In this article, you will see what is Big Data, understand why this name become more present in the vocabulary of Information Technology environments and understand the reasons why the concept of Big Data is more important to the daily life of businesses.
The Concept of Big Data
At first, we define the concept of Big Data as extremely large data sets and, therefore, need tools specially prepared to handle large volumes, so that any and all information on these media can be found, analyzed and harnessed in a timely manner.
It is not difficult to understand this scenario: we exchanged millions of emails per day; thousands of banking transactions happen in the world every second; sophisticated solutions manage the supply chain of several factories right now; operators to record all calls and instant traffic data from the growing number of cell lines worldwide; systems ERP coordinate sectors of numerous companies, and finally, examples abound – if you ask, you will surely be able to point others effortlessly.
Information is power, so if a company knows how to use the data at hand, you can learn how to improve a product, how to create a marketing strategy more efficient, how to cut spending, how to produce more, to avoid wasting resources, as a competitor overcome, such as providing a service to a client in a satisfactory manner and so forth.
Note that, we are talking about factors which may even be decisive for the future of a company. But Big Data is a relatively new name (or, at least, began to appear in the media recently). This means that only in recent years the companies have found the need to make better use of their large databases.
You can be sure that there are times that IT departments include applications of Data Mining, Business Intelligence and CRM (Customer Relationship Management), for example, to deal with data analysis, decision making and other aspects related to the business.
The proposed solution big data is to offer a comprehensive approach in the treatment of aspect increasingly “chaotic” data to make those applications and all other more efficient and accurate. Therefore, the concept considers not only large amounts of data, the speed of analysis and the availability of these, as well as the relationship with and between the volumes.
Facebook is one of the great examples of companies that received benefits from Big Data: the databases of the service increases every day and are used to determine relationships, behaviors and preferences of users.
Why is Big Data so Important?
We deal with data since the dawn of humanity. It turns out that, nowadays, the computational advances allow us to store, organize and analyze data much more easily and with much greater frequency.
This scenario is far from ceasing to be increasing. Just imagine, for example, that many devices in our homes – refrigerators, TVs, washing machines, coffee makers, among others – should be connected to the internet in a not too distant future. This forecast is within what is known as the Internet of Things.
If we look at what we have now, we will see a big change compared to previous decades: drawing on just the internet, think of the amount of data being generated daily only in social networks; note the immense amount of web sites; notice you are able to shop online through even from your phone, when the maximum of computerization that had stores in a not too distant past were isolated systems to manage their physical facilities.
Current technologies allow us to – and allow – exponentially increase the amount of information in the world, and now businesses, governments and other institutions need to know how to deal with this “explosion” of data. The Big Data is proposed to assist in this task, since the computational tools used until now for data management, by itself, can no longer do so satisfactorily.
The amount of data generated and stored daily reached the point where, today, a centralized data processing no longer makes sense for the majority of large entities. Google, for example, has multiple data centers to account for its operations, but it is all in an integrated manner. This “structural partitioning,” is worth remarking, is not a barrier for Big Data – in times of cloud computing, anything but trivial.
Five ‘Vs’ of Big Data: Volume, Velocity, Variety, Veracity and Value
In order to make the idea of Big Data clearer, some experts began to summarize the subject in ways that can satisfactorily describe the basis of the concept: the five ‘Vs’ – volume, velocity and variety, with the value and veracity factors appearing later.
The appearance of the volume (Volume) you already know. We’re talking about really large amounts of data, which grow exponentially and, not infrequently, precisely because they are underused in these conditions.
Speed (Velocity) is another point that you have assimilated. To account for certain issues, the treatment of data (obtaining, recording, updating, anyway) should be done in a timely manner – often in real time. If the size of the database is a limiting factor, the business could be harmed: imagine, for example, the disorder that a credit card operator would – and cause – to dwell hours to approve a transaction from a client because your security system cannot quickly analyze all data that may indicate fraud.
Range (Variety) is another important aspect. The volumes of data we have today are a result of the diversity of information. We have data in structured format, i.e. stored in banks like PostgreSQL and Oracle, and unstructured data coming from numerous sources, such as documents, pictures, audios, videos and so on. You must know how to treat the variety as part of a whole – a type of data can be useless if not combined with other.
The standpoint of truthfulness (Veracity) can also be considered, because not much point in dealing with the combination “volume + speed + range” if data is unreliable. There must be processes that ensure the maximum possible data consistency. Returning to the example operator’s credit card, imagine the problem that the company would have its system to block a transaction genuine by analyzing data not consistent with reality.
Information is power, information is wealth. The combination “volume + speed + range + accuracy”, as well as any and every other aspect that characterizes a solution for Big Data, will prove impossible if the result does not bring significant benefits and to offset the investment. This is the aspect of the amount (Value).
It is clear that these five aspects need not be taken as the perfect definition. Some believe, for example, the combination “volume + speed + range” is sufficient to convey an acceptable notion of Big Data. Under this approach, aspects of reality and the value would be unnecessary, because they are implicit in the business – any serious entity knows that requires consistent data, no entity makes decisions and invest if there is no expectation of return.
The emphasis on these two points perhaps even not necessary to make a reference to what exactly seems obvious. On the other hand, the consideration may be relevant because it reinforces the necessary care to these aspects: a company may be analyzing social networks to get an assessment of the image that customers have of your products, but will this information be reliable to the point of not required to adopt procedures more discerning? Isn’t it necessary to make a deeper study to reduce the risk of an investment before affecting it?
Anyway, the first three ‘Vs’ – volume, velocity and variety – cannot even offer a better definition of the concept, but are not far from doing so. It is understood that Big Data is only huge amounts of data; however, you may have a volume not very large, but still fits in the context factors because of speed and range.
Big Data Solutions
In addition to handle various types of extremely large volumes of data, Big Data solutions also need to work with the processing and distribution of elasticity, i.e. support applications with data volumes that grow substantially in a short time.
The real problem is that the databases “traditional”, especially those who exploit the relational model, like MySQL, PostgreSQL and Oracle, do not show adequate to these requirements, as they are less flexible.
This is because relational databases are usually based on four properties that make adoption safe and efficient, which is why the Atomicity, Consistency, Isolation and Durability type solutions are so popular. This combination is known as ACID. Let’s see a brief description of each:
- Atomicity: every transaction must be atomic, i.e. can only be deemed effective if implemented in full;
- Consistency: all rules applied to the database must be followed;
- Isolation: no transaction can interfere with another that is in progress at the same time;
- Durability: once a transaction is completed, the resulting data cannot be lost.
The problem is that this set of properties is too restrictive for a Big Data solution. The elasticity, for example, can be frustrated by the consistency and atomicity. This is where the concept of NoSQL comes into picture.
The NoSQL refers to database solutions that enable storage of different shapes, not limited to traditional relational model. Banks are more flexible type, including being compatible with a bunch of assumptions that “compete” with the ACID properties: the BASE (Basically Available, Soft state, Eventually consistency).
It is due to relational databases had been exceeded – they are too long and will be useful for a number of applications. What happens is that, generally, the larger a database becomes more costly and difficult it is: you need to optimize, add new servers, and employ more specialists in maintenance.
As a rule, scalar (make it bigger) one NoSQL databases is easier and less costly. This is possible because, besides having properties more flexible type banks are already optimized to work with parallel processing, global distribution (multiple data centers), and an immediate increase in their capacity and others.
In addition, there are more than one category of NoSQL database, causing type solutions can meet the wide range of data that exists, both structured and unstructured: databases document-oriented databases, key / value, databases, graphs, anyway.
Examples of NoSQL databases are Cassandra, MongoDB, HBase, CouchDB, Redis and Riak. But when it comes to Big Data, having only one database does not suffice. You also need to have tools that allow the processing of volumes. At this point, Hadoop is by far the main reference.
We cannot consider Big Data solutions as a perfect computational arsenal, as such type of systems are complex and still unknown by many managers and professionals and its definition is still open for discussion.
The fact is that the idea of Big Data reflects a real scenario, i.e. increasingly, huge volumes of data and therefore require an approach able to enjoy them to the fullest. Just to give an idea of this challenge, IBM announced in late 2012 that, according to their estimates, 90% of the data available in the world were only generated in the previous two years.