Big Data Cloud Solutions – I began my career as a database developer and sysadmin at Oracle in 1998. Over the past 20 years, it has been amazing to see how IT has evolved to handle ever greater amounts of data using technologies such as relational OLTP (Online Transaction). . Processing), database, data warehouse, ETL (Extraction, Transformation and Loading) and OLAP (Online Analytical Processing) reporting, big data and time artificial intelligence, cloud and IoT. All these technologies have enabled a rapid increase in computing power, especially in terms of processors, memory, storage space and network speed. The goal of this article is to first summarize the principles behind handling large amounts of data, and secondly, the thought process that I hope will help you better understand new technologies in the data space and find the right architecture to ride the waves of current and future technology.
In data processing, data usually goes through 2 stages: data processing and use. When any type of data enters an organization (in most cases there are multiple data sources), it is unlikely to be clean or in a form that can be reported or analyzed directly by potential business users internally or externally of the organization. organization. Data processing is therefore required in the first place, which usually involves cleaning, standardizing, transforming and blending the data. The final data is then presented in the data access layer, ready to be reported and used for analysis in all parts. Data processing is also sometimes called data preparation, data integration, or ETL. of these, ETL is probably the most popular name.
Big Data Cloud Solutions
Information processing and retrieval have different objectives and therefore have been achieved with different technologies. Data processing for big data emphasizes “scaling” from the start, which means that every time the amount of data increases, the processing time must still be within the expectations of the available hardware. Total data processing time can range from minutes to hours or days, depending on the amount of data and the complexity of the processing logic. On the other hand, the use of data emphasizes the “fast” response time in the order of seconds. At a high level, the scalability of data processing has been achieved primarily through parallel processing, while fast data access is achieved by optimizing the data structure for access models and increasing the memory available on the servers.
Big Data Icons Set Cloud Computing Royalty Free Vector Image
To clean, standardize and transform data from different sources, data processing must touch each incoming data record. When the record is clean and done, the job is done. This is fundamentally different from accessing data: the latter involves repeatedly retrieving and accessing the same data with different users and / or applications. When the amount of data is small, the data processing speed is less demanding than getting the data and therefore usually occurs in the same database as the final data. As the amount of data increased, it was realized that data processing had to be handled outside of databases to avoid all the additional costs and limitations of a database system that was clearly not originally designed to big data processing. It was then that ETL and then Hadoop began to play an important role in the era of data warehousing and big data.
The challenge of big data processing is that the amount of data to be processed is always at the same level that the hard drive can hold, but far more than the compute memory available at any given time. The basic method for efficient data processing is to divide the data into smaller pieces and process them in parallel. In other words, scalability is achieved by first allowing parallel processing in programming so that as the amount of data increases, the number of parallel processes increases, while each process continues to process the same amount of data as before; Second, by adding more servers with more processors, memory and disks as the number of parallel processes increases.
Parallel big data processing was initially implemented with data partitioning technology in database systems and ETL tools. When a dataset is logically partitioned, each partition can be processed in parallel. Hadoop HDFS (Highly Distributed File Systems) adapts the same principle in the most scalable way. HDFS divides the data into data blocks, where each block has a fixed size. The blocks are then distributed to different server nodes and stored in the metadata repository in the so-called Names node. When a data process starts, the number of processes is determined by the data blocks and available resources (such as processors and memory) on each server node. This means that HDFS allows for massive parallel processing as long as you have enough CPU and memory from multiple servers.
Currently, Spark has become one of the most popular fast engines for large-scale in-memory computing. Does this make sense? Although memory has actually gotten cheaper, it is still more expensive than hard drives. In the big data space, the amount of big data to be processed is always much greater than the amount of available memory. So how does Spark fix it? First, Spark uses the total amount of memory in a distributed environment with multiple data nodes. However, the amount of memory is still insufficient and can become expensive if an organization tries to put big data on a Spark cluster. Let’s consider what kind of processing Spark is suitable for. Data processing always begins with reading the data from disk to memory and finally writing the results to disks. If each record only needs to be processed once before being written to disk, as is the case with typical batch processing, Spark offers no advantage over Hadoop. On the other hand, Spark can keep data in memory for multiple data transformation steps, while Hadoop cannot. This means that Spark offers benefits when the same data is iteratively processed multiple times, which is exactly what is needed in analytics and machine learning. Now consider the following: Since there may be tens or hundreds of such analysis processes running at the same time, how to scale the processing in a convenient way? Clearly, relying solely on in-memory processing may not be the complete answer, and distributed big data storage like Hadoop remains a necessary part of a big data solution that integrates Spark computing.
Big Data Analytics Via Cloud Computing
Another hot topic in the computing space is flow processing. It offers a great advantage in reducing the processing speed because at any given time it only has to process a small amount of data each time the data arrives. However, it is not as versatile as batch processing in two respects: the first is that the input data must be in “flow” mode and the second is that some processing logic that requires aggregation over time periods has yet to be processed. in a batch later.
Finally, cloud solutions offer the ability to scale the distributed computing system more dynamically based on the volume of data and thus the number of parallel processes. This is difficult to achieve internally as new servers need to be planned, budgeted and purchased. If capacity is not well planned, big data processing can be limited by the amount of hardware or by purchasing additional resources. Computing in the cloud benefits greatly from the flexibility of the infrastructure, which can provide greater assurance of achieving the best scalability in a more cost-effective way.
With the above principles in mind, there have been several milestones over the past 2 decades reflecting how to access ever-increasing amounts of data while still returning the requested data in seconds:
The table below shows some popular examples of each type of database, but is not meant to be an exhaustive list. Note that the database can combine more than one technology. For example, Redis is a NoSQL database as well as in memory. Additionally, data retrieval from Data Warehouse and Columnar Storage uses parallel processes to retrieve data whenever possible. Since there can be many different database options depending on the data content, data structure and search methods of users and / or applications, data access is an area that the organization must develop rapidly and continuously . It should also be common for different types of databases or tools to exist at the same time for different purposes.
Cloud Computing Projects With Source Code 
As we can see, the big difference between processing and data access is that access to data ultimately comes from the needs of customers and companies, and choosing the right technology will guide the development of new future products and improve the user experience. . On the other hand, data processing is a company’s primary asset, and large-scale processing and high-quality data production is an essential factor that allows a company to grow with its data. Many companies suffer from the persecution of their data processing system as the amount of data increases,
Cloud data warehouse solutions, cloud data management solutions, data cloud solutions, cloud data storage solutions, cloud based big data solutions, cloud data security solutions, cloud based data solutions, cloud data protection solutions, cloud big data solutions, cloud data backup solutions, tech data cloud solutions, dimension data cloud solutions