What is Big Data?
Definition: Mainly three things define Big Data.
- The data which is very huge in volume.
- The data which is constantly growing.
- The data which is very complicated and difficult to process. The data is so huge and complicated that it is not possible to process it with traditional data management tools.
Some examples of Big Data
- Population data of any country
- Stock Exchange data
- Data is generated by Social Media platforms every day.
Types of Big Data
Structured:
- When the data is in a good format, understandable by computers and humans, in proper order, and which can be easily accessed is called Structured data.
- Here you can see the data is well structure and in a tabular format that can be easily accessed and processed
Unstructured:
- You have Unstructured data when
- you have different varieties of data types like text, images, video, audio, etc..,
- your data keeps growing inconsistently, when you face challenges in deriving proper value out of your data,
- you have issues in processing your data.
- Examples:
- In your MSWord document, you write texts, embed images, videos, links, etc.… If any system is fetching this data and processing it, it is very difficult to bifurcate and process that data.
- Weather forecast reports
- Traffic data
- Military movements
Semi-Structured:
- This is a kind of structured data that neither follows a tabular format to represent the data nor they follow the relational database structure. But still, they contain some tags or markers to identify the structure and representation of the data.
- Examples:
- XML, Other markup languages
- Emails
- JSON
In Azure, we have some services that can be of great help in processing Big Data.
For this chapter, we will focus on key areas only. For individual service explanations, you can refer to Microsoft or service documentation.
Azure Data Lake Analytics features
Instead of deploying and setting up the hardware, you can write the queries and extract only valuable information.
- Run massive parallel data processing
- Develop programs in U-SQL, R, Python, and .Net over Petabytes of data.
- No need to manage infrastructure
- Process data on demand
- Debug and optimize your big data with ease
- Enterprise-grade security, support, and auditing
- Scale instantly
- Pay only for what you use. Per-second billing.
- No upfront cost. No termination fees.
Azure HDInsights features
- Azure HDInsights is a customizable, enterprise-grade service for open-source analytics.
- You can process a massive amount of data with the help of an open-source platform and the global reach of Azure.
- You can run popular open-source frameworks including Apache Hadoop, Spark, Hive, Kafka…
- Pay only for what you use.
- Autoscaling
- Enterprise-grade security helps protect your data
- Build your data lake through Azure data storage and process the data using analytics and storage services like Azure Synapse Analytics, Azure Cosmos DB, Azure Data Lake Storage, Azure Blob Storage, Azure Event Hubs, and Azure Data Factory.
Azure Databricks
- Databricks is based on Apache Spark, which is open-source.
- Can run on multiple computers at the same time and process the same set of data on multiple computers at the same time
- Autoscaling
- Use your preferred language, including Python, Scala, R, Spark SQL and .Net
- Enterprise-grade security helps protect your data
- Native integration with Azure services such as Azure Data Factory, Azure Data Lake Storage, Azure Machine Learning, and Power BI
Azure Synapse Analytics
- This is an analytics service that brings together data integration, data warehouse, and big data analytics.
- You can query the data either serverless or dedicated
- Deliver insights from all your data, across data warehouses.
- Reduce project development time with a unified experience for developing end-to-end analytics solutions.
- Most advanced security and privacy features in the market, such as column and row-level security and dynamic data masking.