You are reading the article Comprehensive Guide On Tcp/Ip Model updated in November 2023 on the website Minhminhbmm.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Comprehensive Guide On Tcp/Ip Model
What is TCP/IP Model?Start Your Free Software Development Course
Web development, programming languages, Software testing & others
Understanding TCP/IP Model?The United States Defense Department initially developed the Internet Protocol Suite during the 1970s. It connects heterogeneous systems and contains a popular set of communication protocols. TCP and IP are the most widely used protocols and differ from the OSI model.
How does TCP/IP work?Below are a few points explaining the working of TCP/IP:
1. Network Access LayerHere, the OSI model’s physical layer and data link layer combine to form the network access layer. It allows the transmission of data, physically, through the protocols and hardware elements of the layer. The availability of ARP is measured at layer 3 and summed up by layer 2 protocols.
2. Internet Layer
IP: Stands for “Internet Protocol,” and it is in charge of packet distribution. This distribution is achieved between the source and the destination through the IP addresses in the packet headers. IPv4 and IPv6 are the most widely used versions. All current websites use IPv4, while IPv6 is largely growing in numbers.
ICMP: Stands for “Internet Control Message Protocol.” All information regarding the network program is scripted here; it is measured to sum up with IP datagrams.
ARP: Stands for “Address Resolution Protocol.” ARP determines the hardware address from the specified internet protocol address. The major classifications of ARP are: Reverse ARP, Gratuitous ARP, Proxy ARP, and Inverse ARP.
3. Host-to-Host LayerIt is much equivalent to the OSI model transport layer. All the complexities of data are shielded from the upper layers. The key protocols here are:
User Datagram Protocol (UDP): It is very cost-effective and can be used when security is not a major factor. UDP is a connectionless protocol.
4. Process LayerAll the functions of the top three layers of the OSI model, i.e., Application Layer, Session Layer, and Presentation layer, execute herein. It is responsible for controlling user interface specifications and node-to-node communication. The most commonly used protocols are: HTTP SNMP, NTP,NFS, HTTPS, FTP, TFTP, Telnet, SSH, SMTP, DNS, DHCP, NFS, X Window, and LPD.
HTTP and HTTPS: Stand for Hypertext Transfer Protocol. These HTTP and HTTPS protocols manage server and browser communication. SSL and HTTP intermix herein. It is a systematic protocol for browser forms (sign in, validation, bank transactions).
SSH: Stands for Secure Shell and is very much similar to Telnet. It is terminal emulation software. The primary reason for preferring SSH is its encrypted connection. Moreover, it is a highly secure network.
NTP: Stands for Network Time Protocol. It harmonizes the clocks in the computer systems into a single standard time zone. NTP plays a crucial role in bank-oriented transactions.
Advantages of TCP/IP Model
Deployable for network-oriented problems.
The model allows communication between heterogeneous networks.
It is an open network protocol suite, which makes it available to an individual or an institute.
A scalable client-oriented architecture allows network additions without current services.
For every system on the network, there is an IP value.
Scope of TCP/IP Model:In the communication world, the base unit is packets. These packets are built using TCP/IP protocols. Every operating system has several sole ethics coded keen on its functioning of the TCP/IP stack. OS fingerprinting works on this basis, By swot up these exclusive ethics, values like MTU and MSS. It has been whispered previously to identify the irregular; there is a need first to recognize what is usual.
Thus, TCP/IP is a powerful network communication protocol and program that allows access to remote terminals and computers through internet systems.
Recommended ArticlesFor related articles to the subject, please visit the following links:
You're reading Comprehensive Guide On Tcp/Ip Model
A Comprehensive Guide On Kubernetes
This article was published as a part of the Data Science Blogathon.
Image-1
IntroductionToday, In this guide, we will dive in to learn about Kubernetes and use it to deploy and manage containers at scale.
Container and microservice architecture had used more to create modern apps. Kubernetes is open-source software that allows you to deploy and manage containers at scale. It divides containers into logical parts to make your application’s management, discovery, and scaling easier.
The main goal of this guide is to provide a complete overview of the Kubernetes ecosystem while keeping it basic and straightforward. It covers Kubernetes’ core ideas before applying them to a real-world scenario.
Even if you have no prior experience with Kubernetes, this article will serve as an excellent starting point for your journey.
So, without further ado, let’s get this learning started.
Why Kubernetes?Before we go into the technical ideas, let us start with why a developer should use Kubernetes in the first place. Here are a few reasons why developers should use Kubernetes in their projects.
Portability
When using Kubernetes, moving containerized applications from development to production appears to be an easy process. Kubernetes enables developers to orchestrate containers in various environments, including on-premises infrastructure, public and hybrid clouds.
Scalability
Kubernetes simplifies the process of defining complex containerized applications and deploying them globally across multiple clusters of servers by reducing resources based on your desired state. Kubernetes automatically checks and maintains container health when horizontally scaling applications.
Extensibility
Kubernetes has a vast and ever-expanding collection of extensions and plugins created by developers and businesses that make it simple to add unique capabilities to your clusters such as security, monitoring, or management.
ConceptsUsing Kubernetes necessitates an understanding of the various abstractions it employs to represent the state of the system. That is the focus of this section. We get acquainted with the essential concepts and provide you with a clearer picture of the overall architecture.
Pods
A Pod is a collection of multiple containers of application that share storage, a unique cluster IP address, and instructions for running them (e.g. ports, restart, container image, and failure policies).
They are the foundation of the Kubernetes platform. While creating a service or a deployment, Kubernetes creates a Pod with the container inside.
Each pod runs on the node where it is scheduled and remains there until it is terminated or deleted. If the node fails or stops, Kubernetes will automatically schedule identical Pods on the cluster’s other available Nodes.
Image-2
Node
A node is a worker machine in a Kubernetes cluster that can be virtual or physical depending on the cluster type. The master is in charge of each node. The master involuntary schedules pods across all nodes in the cluster, based on their available resources and current configuration.
Each node is required to run at least two services:
Kubelet is a process that communicates between the Kubernetes master and the node.
A container runtime is in charge of downloading and running a container image (Eg: Docker)
Image-3
Services
A Service is an abstraction that describes a logical set of Pods and the policies for accessing them. Services allow for the loose coupling of dependent Pods.
Even though each pod has a distinct IP-Address, those addresses are not visible to the outside world. As a result, a service enables your deployment to receive traffic from external sources.
We can expose services in a variety of ways:
ClusterIP (standard) – Only expose the port to the cluster’s internals.
NodePort – Use NAT to reveal the service on the same port on every node in the cluster
Loadbalancer – Create an external load balancer to export the service to a specified IP Address.
Image-4
Deployments
Deployments include a description of your application’s desired state. The deployment controller will process to ensure that the application’s current state matches that description.
A deployment automatically runs many replicates of your program and replaces any instances that fail or become unresponsive. Deployments help to know that your program is ready to serve user requests in this fashion.
Image-5
InstallationBefore we dive into building our cluster, we must first install Kubernetes on our local workstation.
Docker Desktop
If you’re using Docker desktop on Windows or Mac, you may install Kubernetes directly from the user interface’s settings pane.
Others
If you are not using the Docker desktop, I recommend that you follow the official installation procedure for Kubectl and Minikube.
BasicsNow that we’ve covered the fundamental ideas. Let’s move on to the practical side of Kubernetes. This chapter will walk you through the fundamentals required to deploy apps in a cluster.
Creating cluster
When you launch Minikube, it immediately forms a cluster.
minikube start
After installation, the Docker desktop should also automatically construct a cluster. You may use the following commands to see if your cluster is up and running:
# Get information about the cluster kubectl cluster-info # Get all nodes of the cluster kubectl get nodesDeploying an application:
Now that we’ve completed the installation and established our first cluster, we’re ready to deploy an application to Kubernetes.
kubectl create deployment nginx --image=nginx:latestWe use the create deployment command, passing inputs as the deployment name and the container image. This example deploys Nginx with one container and one replica.
Using the get deployments command, you may view your active deployments.
kubectl get deployments Information about deploymentsHere are a few commands you may use to learn more about your Kubernetes deployments and pods.
Obtaining all of the pods
Using the kubectl get pods command, you can get a list of all running pods:
kubectl get podsDetail description of a pod
Use describe command to get more detailed information about a pod.
kubectl describe podsLogs of a pod
The data that your application would transmit to STDOUT becomes container logs. The following command will provide you access to those logs.
kubectl logs $POD_NAMENote: You may find out the name of your pod by using the get pods or describe pods commands.
Execute command in Container
The kubectl exec command, which takes the pod name and the term to run as arguments, allows us to perform commands directly in our container.
kubectl exec $POD_NAME commandLet’s look at an example where we start a bash terminal in the container to see what I mean.
kubectl exec -it $POD_NAME bash Exposing app publiclyA service, as previously said, establishes a policy by which the deployment can be accessible. We’ll look at how this is achieved in this section and other alternatives you have when exposing your services to the public.
Developing a service:
We can build a service with the create-service command, which takes the port we wish to expose and the kind of port as parameters.
kubectl create service nodeport nginx --tcp=80:80It will generate service for our Nginx deployment and expose our container’s port 80 to a port on our host computer.
On the host system, use the kubectl get services command to obtain the port:
Image By Author
As you can see, port 80 of the container had routed to port 31041 of my host machine. When you have the port, you may test your deployment by accessing your localhost on that port.
Deleting a service
kubectl delete service nginxScale up the app
Scaling your application up and down is a breeze with Kubernetes. By using this command, you may alter the number of replicas, and Kubernetes will generate and maintain everything for you.
kubectl scale deployments/nginx --replicas=5This command will replicate our Nginx service to a maximum of five replicas.
This way of application deployment works well for tiny one-container apps but lacks the overview and reusability required for larger applications. YAML files are helpful in this situation.
YAML files allow you to specify your deployment, services, and pods using a markup language, making them more reusable and scaleable. The following chapters will go over Yaml files in detail.
Kubernetes object in YAMLEvery object in Kubernetes had expressed as a declarative YAML object that specifies what and how it should run. These files had used frequently to promote the reusability of resource configurations such as deployments, services, and volumes, among others.
This section will walk you through the fundamentals of YAML and how to acquire a list of all available parameters and characteristics for a Kubernetes object. We glance through the deployment and service files to understand the syntax and how it had deployed.
Parameters of different objects
There are numerous Kubernetes objects, and it is difficult to remember every setting. That’s where the explain command comes in.
You can also acquire documentation for a specific field by using the syntax:
kubectl explain deployment.spec.replicasDeployment file
For ease of reusability and changeability, more sophisticated deployments are typically written in YAML.
The basic file structure is as follows:
apiVersion: apps/v1 kind: Deployment metadata: # The name and label of your deployment name: mongodb-deployment labels: app: mongo spec: # How many copies of each pod do you want replicas: 3 # Which pods are managed by this deployment selector: matchLabels: app: mongo # Regular pod configuration / Defines containers, volumes and environment variable template: metadata: # label the pod labels: app: mongo spec: containers: - name: mongo image: mongo:4.2 ports: - containerPort: 27017There are several crucial sections in the YAML file:
apiVersion – Specifies the API version.
kind – The Kubernetes object type defined in the file (e.g. deployment, service, persistent volume, …)
metadata – A description of your YAML component that includes the component’s name, labels, and other information.
spec – Specifies the attributes of your deployment, such as replicas and resource constraints.
template – The deployment file’s pod configuration.
Now that you understand the basic format, you can use the apply command to deploy the file.
Service file
Service files are structured similarly to deployments, with slight variations in the parameters.
apiVersion: v1 kind: Service metadata: name: mongo spec: selector: app: mongo ports: - port: 27017 targetPort: 27017 type: LoadBalancer StorageWhen the container restarts or pod deletion, its entire file system gets deleted. It is a good sign since it keeps your stateless application from getting clogged up with unnecessary data. In other circumstances, persisting your file system’s data is critical for your application.
There are several types of storage available:
The container file system stores the data of a single container till its existence.
Volumes allow you to save data and share it between containers as long as the pod is active.
Data had saved even if the pod gets erased or restarted using persistent volumes. They’re your Kubernetes cluster’s long-term storage.
Volumes
Volumes allow you to save, exchange, and preserve data amongst numerous containers throughout the pod. It is helpful if you have pods with many containers that communicate data.
In Kubernetes, there are two phases to using a volume:
The volume had defined by the pod.
The container use volume mounts to add the volume to a given filesystem path.
You can add a volume to your pod by using the syntax:
apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - name: nginx image: nginx volumeMounts: - name: nginx-storage mountPath: /etc/nginx volumes: - name: nginx-storage emptyDir: {}Here volumes tag is used to provide a volume mounted to a particular directory of the container filesystem (in this case, /etc/nginx).
Persistent Volumes
These are nearly identical to conventional volumes, with unique difference data had preserved even if the pod gets erased. That is why they are employed for long-term data storing needs, such as a database.
A Persistent Volume Claim (PVC) object, which connects to backend storage volumes via a series of abstractions, is the most typical way to define a persistent volume.
Example of YAML Configuration file.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pv-claim labels: app: sampleAppName spec: accessModes: - ReadWriteOnce resources: requests: storage: 20GiThere are more options to save your data in Kubernetes, and you may automate as much of the process as feasible. Here’s a list of a few interesting subjects to look into.
Compute ResourcesIn consideration of container orchestration, managing computes resources for your containers and applications is critical.
When your containers have a set number of resources, the scheduler can make wise decisions about which node to place the pod. You will also have fewer resource contention issues with diverse deployments.
In the following two parts, we will go through two types of resource definitions in depth.
Requests
Limits
Secrets
Secrets in Kubernetes allow you to securely store and manage sensitive data such as passwords, API tokens, and SSH keys.
To use a secret in your pod, you must first refer to it. It can happen in many different ways:
Using an environment variable and as a file on a drive mounted to a container.
When kubelet pulls a picture from a private registry.
Creating a secret
Secrets had created using either the kubelet command tool or by declaring a secret Kubernetes object in YAML.
Using the kubelet
Kubelet allows you to create secrets with a create command that requires only the data and the secret name. The data gets entered using a file or a literal.
kubectl create secret generic admin-credentials --from-literal=user=poweruser --from-literal=password='test123'Using a file, the same functionality would look like this.
kubectl create secret generic admin-credentials–from-file=./username.txt –from-file=./password.txt
Making use of definition files
Secrets, like other Kubernetes objects, can be declared in a YAML file.
apiVersion: v1 kind: Secret metadata: name: secret-apikey data: apikey: YWRtaW4=Your sensitive information is stored in the secret as a key-value pair, with apiKey as the key and YWRtaW4= as the base decoded value.
Using the apply command, you can now generate the secret.
kubectl apply -f secret.yamlUse the stringData attribute instead if you wish to give plain data and let Kubernetes handle the encoding.
apiVersion: v1 kind: Secret metadata: name: plaintext-secret stringData: password: testImagePullSecrets
If you’re pulling an image from a private registry, you may need to authenticate first. When all of your nodes need to pull a specific picture, an ImagePullSecrets file maintains the authentication info and makes it available to them.
apiVersion: v1 kind: Pod metadata: name: private-image spec: containers: - name: privateapp image: gabrieltanner/graphqltesting imagePullSecrets: - name: authentification-secret NamespacesNamespaces are virtual clusters had used to manage large projects and allocate cluster resources to many users. They offer a variety of names and can be nested within one another.
Managing and using namespaces with kubectl is simple. This section will walk you through the most common namespace actions and commands.
Look at the existing Namespaces
You can use the kubectl get namespaces command to see all of your cluster’s presently accessible namespaces.
kubectl get namespaces # Output NAME STATUS AGE default Active 32d docker Active 32d kube-public Active 32d kube-system Active 32dCreating Namespace
Namespaces can be created with the kubectl CLI or by using YAML to create a Kubernetes object.
kubectl create namespace testnamespace # Output namespace/testnamespace createdThe same functionality may be achieved with a YAML file.
apiVersion: v1 kind: Namespace metadata: name: testnamespaceThe kubectl apply command can then be used to apply the configuration file.
kubectl apply -f testNamespace.yamlNamespace Filtering
When a new object had created in Kubernetes without a custom namespace property, it adds to the default namespace.
You can do this if you want to construct your item in a different workspace.
kubectl create deployment --image=nginx nginx --namespace=testnamespaceYou may now use the get command to filter for your deployment.
kubectl get deployment --namespace=testnamespaceChange Namespace
You’ve now learned how to construct objects in a namespace other than the default. However, adding the namespace to each command you want to run takes time and returns an error.
As a result, you can use the set-context command to change the default context to which instructions had applied.
kubectl config set-context $(kubectl config current-context) --namespace=testnamespaceThe get-context command can be used to validate the modifications.
kubectl config get-contexts # Output CURRENT NAME CLUSTER AUTHINFO NAMESPACE * Default Default Default testnamespace Kubernetes with Docker ComposeFor individuals coming from the Docker community, writing Docker Compose files rather than Kubernetes objects may be simple. Kompose comes into play in this situation. It uses a simple CLI to convert or deploy your docker-compose file to Kubernetes (command-line interface).
How to Install Kompose
It is easy and quickly deployed on all three mature operating systems.
To install Kompose on Linux or Mac, curl the binaries.
# Linux # macOS chmod +x kompose sudo mv ./kompose /usr/local/bin/komposeDeploying using Kompose
Kompose deploys Docker Compose files on Kubernetes using existing Docker Compose files. Consider the following compose file as an example.
version: "2" services: redis-master: image: chúng tôi ports: - "6379" redis-slave: image: gcr.io/google_samples/gb-redisslave:v1 ports: - "6379" environment: - GET_HOSTS_FROM=dns frontend: image: gcr.io/google-samples/gb-frontend:v4 ports: - "80:80" environment: - GET_HOSTS_FROM=dns labels: kompose.service.type: LoadBalancerKompose, like Docker Compose, lets us deploy our setup with a single command.
kompose upYou should now be able to see the resources that had produced.
kubectl get deployment,svc,pods,pvcConverting Kompose
Kompose can also turn your existing Docker Compose file into the Kubernetes object you need.
kompose convertThe apply command had used to deploy your application.
kubectl apply -f filenames Application DeploymentNow that you’ve mastered the theory and all of Kubernetes’ core ideas, it’s time to put what you’ve learned into practice. This chapter will show you how to use Kubernetes to deploy a backend application.
This tutorial’s specific application is a GraphQL boilerplate for the chúng tôi backend framework.
First, let’s clone the repository.
Images to a Registry
We must first push the images to a publicly accessible Image Registry before starting the construction of Kubernetes objects. It can be a public registry like DockerHub or a private registry of your own.
Visit this post for additional information on creating your own private Docker Image.
To push the image, include the image tag in your Compose file along with the registry you want to move.
version: '3' services: nodejs: build: context: ./ dockerfile: Dockerfile image: gabrieltanner.dev/nestgraphql restart: always environment: - DATABASE_HOST=mongo - PORT=3000 ports: - '3000:3000' depends_on: [mongo] mongo: image: mongo ports: - '27017:27017' volumes: - mongo_data:/data/db volumes: mongo_data: {}I used a private registry that I had previously set up, but DockerHub would work just as well.
Creating Kubernetes objects
Now that you’ve published your image to a registry, we’ll write our Kubernetes objects.
To begin, create a new directory in which to save the deployments.
mkdir deployments cd deployments touch mongo.yaml touch nestjs.yamlIt is how the MongoDB service and deployment will look.
apiVersion: v1 kind: Service metadata: name: mongo spec: selector: app: mongo ports: - port: 27017 targetPort: 27017 --- apiVersion: apps/v1 kind: Deployment metadata: name: mongo spec: selector: matchLabels: app: mongo template: metadata: labels: app: mongo spec: containers: - name: mongo image: mongo ports: - containerPort: 27017A deployment object with a single MongoDB container called mongo had included in the file. It also comes with a service that allows the Kubernetes network to use port 27017.
Because the container requires some additional settings, such as environment variables and imagePullSecrets, the chúng tôi Kubernetes object is a little more complicated.
A load balancer helps the service that makes the port available on the host machine.
Deploy the application
Now that the Kubernetes object files are ready. Let us use kubectl to deploy them.
kubectl apply -f mongo.yaml kubectl apply -f nestjs.yamlOn localhost/graphql, you should now view the GraphQL playground.
Congratulations, you’ve just deployed your first Kubernetes application.
Image-6
You persevered to the end! I hope this guide has given you a better understanding of Kubernetes and the way to use it to improve your developer process, with better production-grade solutions.
Kubernetes was created using Google’s ten years of expertise running containerized apps at scale. It has already been adopted by the top public cloud suppliers and technology providers and is now being adopted by the majority of software manufacturers and companies. It even resulted in the formation of the Cloud Native Computing Foundation (CNCF) in 2023, which was the first project to graduate under CNCF and began streamlining the container ecosystem alongside other container-related projects like CNI, Containers, Envoy, Fluentd, gRPC, Jagger, Linkerd, and Prometheus. Its immaculate design, cooperation with industry leaders, making it open source, and always being open to ideas and contributions may be the main reasons for its popularity and endorsement at such a high level.
Share this with other developers, if you find it useful.
To know more about Kubernetes, Check out the links below
Learn basic tenets from our blog.
References
Image-1 – Photo by Ian Taylor On Unsplash
Image-6 – Photo by Xan Griffin On Unsplash
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Related
A Comprehensive Guide On Data Visualization In Python
This article was published as a part of the Data Science Blogathon
Data visualization is the process of finding, interpreting, and comparing data so that it can communicate more clearly complex ideas, thus making it easier to identify once analysis of logical patterns.
Data visualization is important for many analytical tasks including data summaries, test data analysis, and model output analysis. One of the easiest ways to connect with other people is to see good.
Fortunately, Python has many libraries that provide useful tools for extracting data from data. The most popular of these are Matplotlib, Seaborn, Bokeh, Altair, etc.
IntroductionThe ways we plan and visualize the details change quickly and become more and more difficult with each passing day. Due to the proliferation of social media, the availability of mobile devices, and the installation of digital services, data is available for any human activity using technology. The information produced is very important and enables us to analyze styles and patterns and to use big data to draw connections between events. Therefore, data recognition can be an effective way to present the end-user with
comprehensible details in real-time.
Image 1
Data visualization can be important for strategic communication: it helps us interpret available data; identify patterns, tendencies, and inconsistencies; make decisions, and analyze existing processes. All told, it could have a profound effect on the business world. Every company has data, be it contacting customers and senior management or helping to manage the organization itself. Only through research and interpretation can this data be interpreted and converted into information. This article seeks to guide students through a series of basic indicators to help them understand the perception of data and its components and equips them with the tools and platforms they need to create interactive views and analyze data. It seeks to provide students with basic names and crashes courses on design principles that govern data visibility so that they can create and analyze market research reports.
Table of Contents
What is Data Visualization?
Importance of data visualization
Data Visualization Process
Basic principles for data visualization
Data visualization formats
Data Visualization in Python
Color Schemes for Visualization of Data in Python
Other tools for data visualization
Conclusion
End Notes
Data visualization is the practice of translating data into visual contexts, such as a map or graph, to make data easier for the human brain to understand and to draw comprehension from. The main goal of data viewing is to make it easier to identify patterns, styles, and vendors in large data sets. The term is often used in a unique way, including information drawings, information visuals, and mathematical diagrams.
Image 2
Data visualization is one of the steps in the data science process, which, once data has been collected, processed, and modeled, must be visualized to conclude. Data detection is also a feature of the broader data delivery (DPA) discipline, which aims to identify, retrieve, manage, format, and deliver data in a highly efficient manner.
Viewing data is important for almost every job. It can be used by teachers to demonstrate student test results, by computer science artificial intelligence (AI) developers, or by information sharing managers and stakeholders. It also plays an important role in big data projects. As businesses accumulated large data collections during the early years of big data, they needed a way to quickly and easily view all of their data. The viewing tools were naturally matched.
Importance of Data VisualizationWe live in a time of visual information, and visual content plays an important role in every moment of our lives. Research conducted by SHIFT Disruptive Learning has shown that we usually process images 60,000 times faster than a table or text and that our brains do a better job of remembering them in the future. The study found that after three days, the analyzed studies retained between 10% and 20% of written or spoken information, compared to 65% of visual information.
The human brain can perceive imagery in just 13 milliseconds and store information, as long as it is associated with the concept. Our eyes can capture 36,000 visual messages per hour.
40% of nerve fibers are connected to the retina.
All of this shows that people are better at processing visual information, which is embedded in our long-term memory. As a result, in reports and statements, visual representation using images is a more effective way of communicating information than text or table; and takes up very little space. This means that data visibility is more attractive, easier to interact with, and easier to remember.
Data Visualization ProcessSeveral different fields are involved in the data recognition process, to facilitate or reveal existing relationships or discovering something new in a dataset.
1. Filtering and processing.
Refining and refining data transforms it into information by analyzing, interpreting, summarizing, comparing, and researching.
2. Translation & visual representation.
Creating visual representation by describing image sources, language, context, and word of introduction, all for the recipient.
3. Visualization and interpretation.
Finally, visual acuity is effective if it has a cognitive impact on
knowledge construction.
Basic principles for data visualizationThe purpose of seeing data is to help us understand
something they do not represent. It is a way of telling stories and research results, too as data analysis and testing platform. So, you have a good understanding of how to create data recognition will help us to create meaning as well as easy-to-remember reports, infographics, and dashboards. Creating the right perspective helps us to solve problems and analyze subject material in detail. The first step in representing the information is trying to understand that data perception.
1. Preview: This ensures that viewers have more data comprehension, as their starting point for checking. This means giving them a visual summary of different types of data, describing their relationship at the same time. This strategy helps us to visualize the process of data, in all its different levels, simultaneously.
2. Zoom in and filter: The second step involves inserting the first so that viewers can understand the data basement. Zoom in / out enables us to select available data subsets that meet certain methods while maintaining the concept of position and context.
Data visualization formats 1. Bar ChartsBar charts are one of the most popular ways to visualize data because it presents quickly set data
an understandable format that allows viewers to see height and depth at a glance.
They are very diverse and are often used comparing different categories, analyzing changes over time, or comparing certain parts. The three variations on the bar chart are:
Vertical column:
The data is used chronologically, too it should be in left-to-right format.
Horizontal column:
It is used to visualize categories
Full stacked column:
Used to visualize the categories that together add up to 100%
Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)
2. Histograms
Histograms represent flexibility in the form of bars, where the face of each bar is equal to the number of values represented. They offer an overview of demographic or sample distribution with a particular aspect. The two differences in the histogram are:
Standing columns
Horizontal columns
Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)
3. Pie charts
The pie chart contains a circle divided into categories, each representing a portion of the theme. They can be divided into no more than five data groups. They can be useful for comparing different or continuous data.
The two differences in the pie chart are:
Standard: Used to show relationships between components.
Donut: A variation of style that facilitates the inclusion of a whole value or design element in the center.
Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)
4. Scatter PlotScatter plots sites use a point spread over the Cartesian integration plane to show the relationship between the two variables. They also help us determine whether the different data groups are related or not.
Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)
5. Heat MapsSource: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)
6. Line PlotThis is used to display changes or trends in data over time. They are especially useful in showing relationships, speeding, slowing down, a
nd instability in the data set.
Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)
Color Schemes for Data Visualization in PythonColor is one of the most powerful data resources visual acuity, and it is important if we are to understand the details correctly. Color can be used to separate elements, balance or represents values, and interacts with cultural symbols associated with a particular color. It rules our understanding again so that we can analyze it, we must first understand its three types:
Hue: This is what we usually think of when we upload a photo color. There is no order of colors; they can only be distinguished by their characteristics (blue, red, yellow, etc.).
Brightness: This is an average measure that describes the amount of light reflected in an object with another. Light is measured on a scale, and we can talk about bright and dark values in one color.
Saturation
: this refers to the intensity of a given color. It varies according to light. Dark colors are less saturated, and when color is less saturated, they approach gray. In other words, it comes close to a neutral (empty) color. The following diagram provides a summary of the color application.
to Data Visualization (Melisa Matias)
Data Visualization in PythonWe’ll start with a basic look at the details, then move on to chart planning and finally, we’ll create working charts.
We will work with two data shares that will match the display we are showing in the article, data sets can be downloaded here
It is a description of the popularity of Internet search in three terms related to artificial intelligence (data science, machine learning, and in-depth learning). They were removed from a popular search engine.
There are two chúng tôi and chúng tôi files. The first one we will use in most studies includes data on the popularity of three words over time (from 2004 to now, 2023). In addition, I have added category variables (singular and zero) to show the functionality of charts that vary by category.
The chúng tôi file contains country-class preference data. We will use it in the final section of the article when working with maps.
Before we move on to the more sophisticated methods, let’s start with the most basic way of visualizing data. We will simply use pandas to look at the details and get an idea of how it is being distributed.
The first thing we have to do is visualize a few examples to see which columns, what information they contain, how the numbers are written.
In the descriptive command, we will see how the data is distributed, size, minimum, mean.
df.describe()With the information command, we will see what kind of data each column includes. We can find a column case that when viewed with a command of the head appears to be a number but if we look at the data following the values of the string format, the variable will be written as a character unit.
df.info() Data Visualization in Python using MatplotlibMatplotlib is the most basic library for viewing information about drawings. It includes as many graphs as we can think of. Just because it is basic does not mean that it is weak, many of the other viewing libraries we will be talking about are based on it.
Matplotlib charts are made up of two main elements, axes (lines separating the chart area) and a number (where we draw the X-axis and Y-axis). Now let’s build the simplest graph:
import matplotlib.pyplot as plt plt.plot(df['Mes'], df['data science'], label='data science')We can make graphs of many variations on the same graph and compare them.
plt.plot(df['Mes'], df['data science'], label='data science') plt.plot(df['Mes'], df['machine learning'], label='machine learning') plt.plot(df['Mes'], df['deep learning'], label='deep learning') plt.xlabel('Date') plt.ylabel('Popularity') plt.title('Popularity of AI terms by date') plt.grid(True) plt.legend()If you are working with Python from a terminal or script, after explaining the graph of the functions listed above use chúng tôi (). If working from Jupyter notebook, add% matplotlib to the queue at the beginning of the file and run it before creating a chart.
We can do many graphics in one number. This is best done by comparing charts or sharing information from several types of charts easily with a single image.
fig, axes = plt.subplots(2,2) axes[0, 0].hist(df['data science']) axes[0, 1].scatter(df['Mes'], df['data science']) axes[1, 0].plot(df['Mes'], df['machine learning']) axes[1, 1].plot(df['Mes'], df['deep learning'])We can draw a graph with different styles of different points for each:
plt.plot(df['Mes'], df['data science'], 'r-') plt.plot(df['Mes'], df['data science']*2, 'bs') plt.plot(df['Mes'], df['data science']*3, 'g^')Now let’s look at a few examples of different graphics we can make with Matplotlib. We start with the scatterplot:
plt.scatter(df['data science'], df['machine learning'])With Bar chart:
plt.bar(df['Mes'], df['machine learning'], width=20)With Histogram:
plt.hist(df['deep learning'], bins=15) Data Visualization in Python using SeabornSeaborn is a library based on Matplotlib. Basically what it offers us are beautiful drawings and works to create complex types of drawings with just one line of code.
We enter the library and start drawing style with chúng tôi (), without this command the graphics will still have the same style as Matplotlib. We show you one of the simplest graphics, scatterplot.
import seaborn as sns sns.set() sns.scatterplot(df['Mes'], df['data science'])We can add details of more than two changes to the same graph. In this case, we use colors and sizes. We also create a separate graph depending on the category column value:
sns.relplot(x='Mes', y='deep learning', hue='data science', size='machine learning', col='categorical', data=df)One of the most popular drawings provided by Seaborn is the heatmap. It is very common to use it to show all connections between variables in the dataset:
sns.heatmap(df.corr(), annot=True, fmt='.2f')Another favorite is the pair plot which shows the relationship between all the variables. Be aware of this function if you have a large database, as it should show all data points as often as columns, meaning that by increasing the data size, the processing time is greatly increased.
sns.pairplot(df)Now let’s make a pair plot showing charts divided into price range by category
sns.pairplot(df, hue='categorical')A very informative joint plot graph that allows us to see the spread plot as well as the histogram of two types and see how they are distributed:
sns.jointplot(x='data science', y='machine learning', data=df)Another interesting drawing is the VietnaminPlot:
sns.catplot(x='categorical', y='data science', kind='violin', data=df) Data Visualization in Python using BokehBokeh is a library that allows you to produce interactive graphics. We can send them to HTML text that we can share with anyone with a web browser.
It is a very useful library where we have the desire to look at things in drawings and want to be able to zoom in on a picture and walk around the picture. Or when we want to share it and allow someone else to test the data.
We start by entering the library and defining the file to save the graph:
from bokeh.plotting import figure, output_file, save output_file('data_science_popularity.html')We draw what we want and save it to a file:
p = figure(title='data science', x_axis_label='Mes', y_axis_label='data science') p.line(df['Mes'], df['data science'], legend='popularity', line_width=2) save(p) Other Tools for Data VisualizationSome data visualization tools help in visualizing the data effectively and faster than the traditional python coding method. These are some of the examples:
Databox
Databox is a data recognition tool used by more than 15,000 businesses and marketing agencies. Databox pulls your data in one place to track real-time performance with attractive displays.
Databox is ideal for marketing groups that want to be quickly set up with dashboards. With a single 70+ combination and no need to code, it is a very easy tool to use.
Zoho Analytics
Zoho Analytics is probably one of the most popular BI tools on this list. One thing you can be sure of is that with Zoho analytics, you can upload your data securely. Additionally, you can use a variety of charts, tables, and objects to transform your data concisely.
Tableau
If you want to easily visualize and visualize data, then Tableau is a tool for visualizing your data. It helps you to create charts, maps, and all other technical graphics. To improve your visual presentation, you can also get a desktop app.
Additionally, if you are experiencing a problem with the installation of any third-party application, then it provides a “lock server” solution to help visualize online and mobile messaging applications.
You can check out my article on Analytics Vidhya for more information on trending Data Visualization Tools. Top 10 Data Visualization Tools.
ConclusionWith all these different libraries you may be wondering which library is right for your project. The quick answer is a library that lets you easily create the image you want.
In the initial stages of the project, with pandas and pandas profiling we will make a quick visualization to understand the data. If we need to visualize more details we can use simple graphs that we can find in the plots such as scatterplots or histograms.
End NotesIn this article, we discussed Data Visualization. Some basic formats of data visualization and some practical implementation of python libraries for data visualization. Finally, we concluded with some tools which can perform the data visualization in python effectively.
Thanks For Reading!
About Me:
Hey, I am Sharvari Raut. I love to write!
Connect with me on:
Image Source
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Related
Comprehensive Guide To Matlab Count
Introduction to Matlab Count
To find the presence of a particular event in the Matlab program count command is used. If an input is a string then by using this command we can find out how many times a specific character occurs in the input string. We can apply count command on array also. If the input is set of numbers in the form of the array then by using count command we can find out how much time a particular number present in the array. The count command is used in two ways depending upon the parameter list.
Start Your Free Data Science Course
Hadoop, Data Science, Statistics & others
How does Count work in Matlab?The count command is used in two ways .one is case sensitive and the other one is not. Mostly this command is used in strings operations where we need to find out the occurrence of characters. But input string one character may be present multiple times and in both the cases upper case and lower case. Therefore the second way is used to ignore the case of alphabets.
Step 1: Accept the input string.
Step 2: Declare variable to store count and apply the command count.
Step 3: Display the result.
Syntax:
Variable name = count (input, ‘event’, ‘ignoreCase’, true)
Examples of Matlab CountThe example of the following are given below:
Example #1Let us consider input string as “blueberry is blue” and we need to find out the occurrence of “blue” in a given string. Table 1illustrate the Matlab code, for example, chúng tôi output is 2 because blue occurs two times one in a single string and second is along with other string (blueberry).
Code:
c= count( input , ‘blue’)
Output:
Example #2Let us consider input string as “Blueberry is blue” and we need to find out the occurrence of “blue” in the given string. Here the output is 1 because blue occurs two times one in a single string and second is along with other string (blueberry). But in uppercase therefore if we use the second method to write count command then the output will be 2. Table 2(a) illustrates the Matlab code for example 2 by using the first approach and .table 2(b) illustrates the Matlab code for example 2 by using the second approach.
c= count( input , ‘blue’)
Output:
Code:
c= count( input , ‘blue’ ,IgnoreCase’ ,true)
Output:
Example #3Let us consider input string in multidimensional array as “ one, five, eleven, five, four, ten, one, four, three ”; “ one, zero, ten, zero, ten, one, two, ten, one, eight ” and we need to find out occurrence of “ one ” in given string. Here the output is different for different rows. The elements of two rows separated by “; ”. Table 3 illustrates the Matlab code for example 3.
Code:
Input = [“ one, five, eleven, five, four, ten, one, four, three ”; “ one, zero, ten, zero, ten, one, two, ten, one, eight ”]
Output:
Example #4Let us consider input string in multidimensional array as “ one, five, eleven, five, four, ten, one, four, three ”; “ one, zero, ten, zero, ten, one, two, ten, one, eight ” and we need to find out occurrence of two events simultaneously .“ one ”,” two” in given string. Here the output is different for different rows. The elements of two rows separated by “; ”. Tables 4 illustrate the Matlab code, for example, chúng tôi output is 2 and 4 because in the first string ‘one’ occurs two times and ‘two’ occurs zero times. And in the second string ‘one’ occurs three times and ‘ two ’ occurs one time.
Code:
Input = [“ one, five, eleven, five, four, ten, one, four, three ”; “ one, zero, ten, zero, ten, one, two, ten, one, eight ”] c=count(input , [“one”,“two”]
Output:
ConclusionAs we have seen in the above example count command is used in multiple ways .we can apply this command one dimensional as well as multidimensional arrays .mostly count command is used in string operation to count the occurrence of alphabets in any manner.
Recommended ArticlesThis is a guide to Matlab Count. Here we discuss the introduction, How does Count work in Matlab and Examples along with the Syntax and the codes & outputs. You can also go through our other related articles to learn more–
A Simple Guide To Perform A Comprehensive Content Audit
No matter why you have a website, it needs to be filled with great content.
Without good content, you might as well not have a website at all.
But how exactly do you know when you have good content?
You might read through a piece of content and think it’s perfectly fine, but there’s a more reliable way of figuring it out.
If you’re wondering if your content is performing well, there’s a good chance it’s time for a content audit to check for sure.
By following the right steps, knowing what to look for, and what you’re hoping to get out of your content audit, you can look forward to creating a better website.
What Is a Content Audit?At some point, every website will need a content audit.
A content audit gives you the opportunity to review closely all of the content on your website and evaluate how it’s working for you and your current goals.
This helps show you:
What content is good.
What needs to be improved.
What should just be tossed away.
What your content goals for the future should look like.
There are also some types of websites that are more in need of content audits than others.
If you have a relatively new website where all of your content is still fresh, you won’t really be in need of a content audit for a while.
Older sites have a lot more to gain from having a content audit done, as well as websites that have a large amount of content.
This makes websites like a news site a great contender for audits. The size of a website will also affect how often a content audit is necessary.
What Is the Purpose of Content Audits?Content is known for being a great digital marketing investment because it will continue to work for you long into the future, but that doesn’t mean that it doesn’t require some upkeep from time to time.
What worked for your website at one point might not anymore, so it only makes sense to go back and review it.
Improve Organic RankingIf you aren’t ranking highly, it could be a problem with your content.
Some of the content you have might not be SEO-friendly, and although it might be valuable content to have, there’s no way for it to rank highly.
If the content you have is already good, optimizing it to be more SEO-friendly can be a simple change that makes a big difference in your rankings.
Revitalize Older ContentEven the best content gets old at some point.
After a while, you might end up missing out on important keywords, having content with broken links, outdated information, among other issues.
If older content isn’t performing well, that doesn’t mean it can’t serve a new purpose for your website.
Giving new life to some of your older content can give you the same effect as having something totally brand new, without requiring you to put in the amount of work that an entirely new piece of content would need.
Get Rid of Irrelevant ContentSome content that’s great at the moment only benefits you for a short while.
While you might find older content on your website that can be updated to be more useful, sometimes it has just become irrelevant.
When this is the case, you don’t have to keep it if it’s only taking up space.
Eliminate Similar & Duplicate ContentIn addition to unimportant pages, you can also find duplicate content to get rid of during a content audit.
Duplicate content can often occur by accident and wasn’t created to try and cheat the system, but regardless of why you have it, you can be penalized by search engines for it.
If you do find that you have extremely similar or duplicate content, but you can’t get rid of it, you can fix the problem by canonicalizing your preferred URL.
Plan for the FutureWhen you go through the content you currently have, you might end up seeing some gaps that need to be filled.
When you realize you’re missing out on important information and topics your audience needs, this is the time to make up for that.
You’ll be able to realize what’s lacking in your website to create more useful content in the future.
How to Perform a Content AuditA content audit at first glance might seem likely simply reading through your website’s content, but there’s much more to it than that.
For an effective content audit, you’ll need to rely heavily on online tools to get the data you need.
So, before you get started with a content audit, it’s important to know exactly what you’ll need to be doing beforehand.
1. Know Your ReasonIf you’re going through the effort of performing a content audit, you’re not doing it for nothing.
There must be some goal that’s driving you to do this.
Not everyone will have the same reason for having a content audit, although many of the reasons might seem similar, so what you’ll want to look for might vary.
2. Use Screaming Frog to Index Your URLsOne tool that you should always use during an audit is Screaming Frog.
This tool will allow you to create an inventory of the content you have on your website by gathering URLs from your sitemap.
If you have fewer than 500 pages to audit, you can even get away with using the free version.
This is one of the easiest ways of getting all of your content together to begin your content audit.
3. Incorporate Google Analytics DataAfter you’ve made an inventory of your website’s content, you’ll need to see how it’s performing.
For this, Google Analytics can give you all the information you need.
This can give you valuable insights as to how people feel about your content, such as how long they stick around for it and how many pages they’re viewing per session.
4. Examine Your FindingsThe data you get from Google Analytics will make it easier for you to figure out what your next move will be.
After reviewing your findings, it might be clear what’s holding your content down.
The solution may not be obvious, but by looking closely at what your data tells you and researching, you can figure it out with a little bit of effort.
For example, if you have one great, high-quality piece of content that doesn’t get many views, it might just need to be updated slightly and reshared.
5. Make a PlanFinally, you should figure out what the necessary changes will be and how you’ll go about making them.
If you have a long list of changes that need to be implemented, consider which ones are a priority and which ones can be fixed over time.
Planning for the future might include not just the changes to be made on existing content, but the arrangements for creating new content in the future.
Finals ThoughtsContent audits might seem intimidating, but they are key to making sure all of the content on your website is working for you and not against you.
Performing a content audit doesn’t mean that you’ve been making huge mistakes with your content.
Such an audit is simply maintenance that even websites with the best content need to do.
Getting into this can seem overwhelming, but with the right help, an audit will leave you feeling more confident in your content and will help guide your next steps.
More Resources:
A Comprehensive Guide To Time Series Analysis And Forecasting
This article was published as a part of the Data Science Blogathon.
Time Series Analysis and Forecasting is a very pronounced and powerful study in data science, data analytics and Artificial Intelligence. It helps u changing time. For example, let us suppose you have visited a clinic due to some chest pain and want to get an (ECG) test done to see if your heart is healthy functioning. The ECG graph produced is a time-series data where your Heart Rate Variability (HRV) with respect to time is plotted, analysing which the doctor can suggest crucial measures to take care of your heart and reduce the risk of stroke or heart attacks. Time Series is used widely in healthcare analytics, geospatial analysis, weather forecasting, and to forecast the future of data that changes continuously with time!
What is Ti ries Analysis in Machine Learning?Time-series analysis is the process of extracting useful information from time-series data to forecast and gain insights from it. It consists of a series of data that varies with time, hence continuous and non-static in nature. It may vary from hours to minutes and even seconds (milliseconds to microseconds). Due to its non-static and continuous nature, working with time-series data is indeed difficult even today!
As time-series data consists of a series of observations taken in sequences of time, it is entirely non-static in nature.
Time Series – Analysis Vs. ForecastingTime series data analysis is the scientific extraction of useful information from time-series data to gather insights from it. It consists of a series of data that varies with time. It is non-static in nature. Likewise, it may vary from hours to minutes and even seconds (milliseconds to microseconds). Due to its continuous and non-static nature, working with time-series data is challenging!
As time-series data consists of a series of observations taken in sequences of time, it is entirely non-static in nature.
Time Series Analysis and Time Series Forecasting are the two studies that, most of the time, are used interchangeably. Although, there is a very thin line between this two. The naming to be given is based on analysing and summarizing reports from existing time-series data or predicting the future trends from it.
Thus, it’s a descriptive Vs. predictive strategy based on your time-series problem statement.
In a nutshell, time series analysis is the study of patterns and trends in a time-series data frame by descriptive and inferential statistical methods. Whereas, time series forecasting involves forecasting and extrapolating future trends or values based on old data points (supervised time-series forecasting), clustering them into groups, and predicting future patterns (unsupervised time-series forecasting).
The Time Series IntegrantsAny time-series problem or data can be broken down or decomposed into several integrants, which can be useful for performing analysis and forecasting. Transforming time series into a series of integrants is called Time Series Decomposition.
A quick thing worth mentioning is that the integrants are broken further into 2 types-
1. Systematic — components that can be used for predictive modelling and occur recurrently. Level, Trend, and Seasonality come under this category.
2. Non-systematic — components that cannot be used for predictive modelling directly. Noise comes under this category.
The original time series data is hence split or decomposed into 5 parts-
1. Level — The most common integrant in every time series data is the level. It is nothing but the mean or average value in the time series. It has 0 variances when plotted against itself.
2. Trend — The linear movement or drift of the time series which may be increasing, decreasing or neutral. Trends are observable over positive(increasing) and negative(decreasing) and even linear slopes over the entire range of time.
3. Seasonality — Seasonality is something that repeats over a lapse of time, say a year. An easy way to get an idea about seasonality- seasons, like summer, winter, spring, and monsoon, which come and go in cycles throughout a specified period of time. However, in terms of data science, seasonality is the integrant that repeats at a similar frequency.
Note — If seasonality doesn’t occur at the same frequency, we call it a cycle. A cycle does not have any predefined and fixed signal or frequency is very uncertain, in terms of probability. It may sometimes be random, which poses a great challenge in forecasting.
4. Noise — A irregularity or noise is a randomly occurring integrant, and it’s optional and arrives under observation if and only if the features are not correlated with each other and, most importantly, variance is the similar across the series. Noise can lead to dirty and messy data and hinder forecasting, hence noise removal or at least reduction is a very important part of the time series data pre-processing stage.
5. Cyclicity — A particular time-series pattern that repeats itself after a large gap or interval of time, like months, years, or even decades.
The Time Series Forecasting ApplicationsTime series analysis and forecasting are done on automating a variety of tasks, such as-
Weather Forecasting
Anomaly Forecasting
Sales Forecasting
Stock Market Analysis
ECG Analysis
Risk Analysis
and many more!
Time Series Components CombinatoricsA time-series model can be represented by 2 methodologies-
The Additive Methodology —
When the time series trend is a linear relationship between integrants, i.e., the frequency (width) and amplitude(height) of the series are the same, the additive rule is applied.
Additive methodology is used when we have a time series where seasonal variation is linear or constant over timestamps.
It can be represented as follows-
y(t) or x(t) = level + trend + seasonality + noise
where the model y(multivariate) or x(univariate) is a function of time t.
The Multiplicative Methodology —
When the time series is not a linear relationship between integrants, then modelling is done following the multiplicative rule.
The multiplicative methodology is used when we have a time series where seasonal variation increases with time — which may be exponential or quadratic.
It is represented as-
y(t) or x(t)= Level * Trend * Seasonality * Noise
Deep-Dive into Supervised Time-Series ForecastingSupervised learning is the most used domain-specific machine learning, and hence we will focus on supervised time series forecasting.
This will contain various detailed topics to ensure that readers at the end will know how to-
Load time series data and use descriptive statistics to explore it
Scale and normalize time series data for further modelling
Extracting useful features from time-series data (Feature Engineering)
Checking the stationarity of the time series to reduce it
ARIMA and Grid-search ARIMA models for time-series forecasting
Heading to deep learning methods for more complex time-series forecasting (LSTM and bi-LSTMs)
So without further ado, let’s begin!
Load Time Series Data and Use Descriptive Statistics to Explore itFor the easy and quick understanding and analysis of time-series data, we will work on the famous toy dataset named ‘Daily Female Births Dataset’.
Get the dataset downloaded from here.
Importing necessary libraries and loading the data –
import numpy import pandas import statmodels import matplotlib.pyplot as plt import seaborn as sns data = pd.read_csv(‘daily-total-female-births-in-cal.csv’, parse_dates = True, header = 0, squeeze=True) data.head()This is the output we get-
1959–01–01 35 1959–01–02 32 1959–01–03 30 1959–01–04 31 1959–01–05 44 Name: Daily total female births in California, 1959, dtype: int64Note —Remember, it is required to use ‘parse_dates’ because it converts dates to datetime objects that can be parsed, header=0 which ensures the column named is stored for easy reference, and squeeze=True which converts the data frame of single object elements into a scalar.
Exploring the Time-Series Data –
print(data.size) #output-365(a) Carry out some descriptive statistics —
print(data.describe())Output —
count 365.000000 mean 41.980822 std 7.348257 min 23.000000 25% 37.000000 50% 42.000000 75% 46.000000 max 73.000000(b) A look at the time-series distribution plot —
pyplot.plot(series) pyplot.show() Scale and Normalize Time Series Data for Further ModellingA normalized data scales the numeric features in the training data in the range of 0 and 1 so that gradient descent and loss optimization is fast and efficient and converges quickly to the local minima. Interchangeably known as feature scaling, it is crucial for any ML problem statement.
Let’s see how we can achieve normalization in time-series data.
For this purpose, let’s pick a highly fluctuating time-series data — the minimum daily temperatures data. Grab it here!
Let’s have a look at the extreme fluctuating nature of the data —
Source-DataMarket
To normalize a feature, Scikit-learn’s MinMaxScaler is too handy! If you want to generate original data points after prediction, an inverse_transform() function is also provided by this awesome built-in function!
Here goes the normalization code —
# import necessary libraries import pandas from sklearn.preprocessing import MinMaxScaler # load and sanity check the data data = read_csv(‘daily-minimum-temperatures-in-me.csv’, parse_dates = True, header = 0, squeeze=True, index_col=0) print(data.head()) #convert data into matrix of row-col vectors values = data.values values = values.reshape((len(values), 1)) # feature scaling scaler = MinMaxScaler(feature_range=(0, 1)) #fit the scaler with the train data to get min-max values scaler = scalar.fit(values) print(‘Min: %f, Max: %f’ % (scaler.data_min_, scaler.data_max_)) # normalize the data and sanity check normalized = scaler.transform(values) for i in range(5): print(normalized[i]) # inverse transform to obtain original values original_matrix= scaler.inverse_transform(normalized) for i in range(5): print(original_matrix[i])Let’s have a look at what we got –
See how the values have scaled!
Note — In our case, our data does not have outliers present and hence a MinMaxScaler solves the purpose well. In the case where you have an unsupervised learning approach, and your data contains outliers, it is better to go for standardization, which is more robust than normalization, as normalization scales the data close to the mean which doesn’t handle or include outliers leading to a poor model. Standardization, on the other hand, takes large intervals with a standard deviation value of 1 and a mean of 0, thus outlier handling is robust.
More on that here!
Extracting Useful Features from Time-Series Data (Feature Engineering)Framing data into a supervised learning problem simply deals with the task of handling and extracting useful features and discarding irrelevant features to make the model robust and cost-efficient.
We already know that supervised learning problems have 2 types of features — the independents (x) and dependent/target(y). Hence, how better the target value is achieved depends on how well we choose and engineer the independent features.
You must know by now that time-series data has two columns, timestamp, and its respective value. So, it is very self-explanatory that in the time series problem, the independent feature is time and the dependent feature is value.
Now let us look at what are the features that need to be engineered into these input and output values so that the inherent relationship between these two variables is established to make the forecasting as good as possible.
The features which are extremely important to model the relationship between the input and output variables in a time series are —
1. Descriptive Statistical Features — Quite straightforward as it sounds, calculating the statistical details and summary of any data is extremely important. Mean, Median, Standard Deviation, Quantiles, and min-max values. These come extremely handy while in tasks such as outlier detection, scaling and normalization, recognizing the distribution, etc.
2. Window Statistic Features — Window features are a statistical summary of different statistical operations upon a fixed window size of previous timestamps. There are, in general, 2 ways to extract descriptive statistics from windows. They are
(a) Rolling Window Statistics: The rolling window focuses on calculating rolling means or what we conventionally call Moving Average, and often other statistical operations. This calculates summary statistics (mostly mean) across values within a specific sliding window, and then we can assign these as features in our dataset.
Let, the mean at timestamp t-1 is x and t-2 be y, so we find the average of x and y to predict the value at timestamp t+1. The rolling window hence takes a mean of 2 values to predict the 3rd value. After that is done, the window shifts to the next set of values, and hence the mean is calculated for each window consisting of 2 values. We use rolling window statistics more often when the recent data is more important for forecasting and not previous data.
Let’s see how we can calculate moving or rolling average with a rolling window —
from pandas import DataFrame from pandas import concat df = DataFrame(data.values) tshifts = df.shift(1) rwin = tshifts.rolling(window=2) moving_avg = rwin.mean() joined_df = concat([moving_avg, df], axis=1) joined_df.columns = [‘mean(t-2,t-1)’, ‘t+1’] print(joined_df.head(5))Let’s have a look at what we got —
(b) Expanding Window Statistics: Almost similar to the rolling window, expanding windows takes into account an extra habit of extracting the predicted value as well as all the previous observations, each time it expands. This is beneficial when the previous data is equally important for forecasting as well as the recent data.
Let’s have a quick look at expanding window code-
window = tshifts.expanding() joined_df2 = concat([rwin.mean(),df.shift(-1)], axis=1) joined_df2.columns = ['mean', 't+1'] print(joined_df2.head(5))Let’s have a look at what we got -
3. Lag Features — Lag is simply predicting the value at timestamp t+1, provided we know the value at the previous timestamp, say, t-1. It’s simply distance or lag between two values at 2 different timestamps.
4. Datetime Features — This is simply the conversion of time into its specific components like a month, or day, along with the value of temperature for better forecasting. By doing this, we can gather specific information about the month and day at a particular timestamp for each record.
5. Timestamp Decomposition — Timestamp decomposition includes breaking down the timestamp into subset columns of timestamp for storing unique and special timestamps. Before Diwali or, say, Christmas, the sale of crackers and Santa-caps, fruit-cakes increases exponentially more than at other times of the year. So storing such a special timestamp by decomposing the original timestamp into subsets is useful for forecasting.
Time-series Data Stationary ChecksSo, let’s first digest what stationary time-series data is!
Stationary, as the term suggests, is consistent. In time-series, the data if it does not contain seasonality or trends is termed stationary. Any other time-series data that has a specific trend or seasonality, are, thus, non-stationary.
Can you recall, that amongst the two time-series data we worked on, the childbirths data had no trend or seasonality and is stationary. Whereas, the average daily temperatures data, has a seasonality factor and drifts, and hence, it’s non-stationary and hard to model!
Stationarity in time-series is noticeable in 3 types —
(a) Trend Stationary — This kind of time-series data possesses no trend.
(b) Seasonality Stationary — This kind of time-series data possesses no seasonality factor.
(c) Strictly Stationary — The time-series data is strictly consistent with almost no variance to drifts.
Now that we know what stationarity in time series is, how can we check for the same?
Vision is everything. A quick visualization of your time-series data at hand can give a quick eye review of whether the data can be stationary or not. Next in the line comes the statistical summary. A clear look into the summary statistics of the data like min, max, variance, deviation, mean, quantiles, etc. can be very helpful to recognize drifts or shifts in data.
Lets POC this!
So, we take stationary data, which is the handy childbirths data we worked on earlier. However, for the non-stationary data, let’s take the famous airline-passenger data, which is simply the number of airline passengers per month, and prove how they are stationary and non-stationary.
Case 1 — Stationary Proof
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv(‘daily-total-female-births.csv’, parse_dates = True, header = 0, squeeze=True) data.hist() plt.show()Output —
As I said, vision! Look how the visualization itself speaks that it’s a Gaussian Distribution. Hence, stationary!
More curious? Let’s get solid math proof!
X = data.values seq = round(len(X) / 2) x1, x2 = X[0:seq], X[seq:] meanx1, meanx2 = x1.mean(), x2.mean() varx1, varx2 = x1.var(), x2.var() print(‘meanx1=%f, meanx2=%f’ % (meanx1, meanx2)) print(‘variancex1=%f, variancex2=%f’ % (varx1, varx2))Output —
meanx1=39.763736, meanx2=44.185792 variancex1=49.213410, variancex2=48.708651The mean and variances linger around each other, which clearly shows the data is invariant and hence, stationary! Great.
Case 2— Non-Stationary Proof
import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv(‘international-airline-passengers.csv’, parse_dates = True, header = 0, squeeze=True) data.hist() plt.show()Output —
The graph pretty much gives a seasonal taste. Moreover, it is too distorted for a Gaussian tag. Let’s now quickly get the mean-variance gaps.
X = data.values seq = round(len(X) / 2) x1, x2 = X[0:seq], X[seq:] meanx1, meanx2 = x1.mean(), x2.mean() varx1, varx2 = x1.var(), x2.var() print(‘meanx1=%f, meanx2=%f’ % (meanx1, meanx2)) print(‘variancex1=%f, variancex2=%f’ % (varx1, varx2))Output —
meanx1=182.902778, meanx2=377.694444 variancex1=2244.087770, variancex2=7367.962191Alright, the value gap between mean and variances are pretty self-explanatory to pick the non-stationary kind.
ARMA, ARIMA, and SARIMAX Models for Time-Series ForecastingA very traditional yet remarkable ‘machine-learning’ way of forecasting a time series is the ARMA (Auto-Regressive Moving Average) and Auto Regressive Integrated Moving Average Model commonly called ARIMA statistical models.
Other than these 2 traditional approaches, we have SARIMA (Seasonal Auto-Regressive Integrated Moving Average) and Grid-Search ARIMA, which we will see too!
So, let’s explore the models, one by one!
ARMA
The ARMA model is an assembly of 2 statistical models — the AR or Auto-Regressive model and Moving Average.
The Auto-Regressive Model estimates any dependent variable value y(t) at a given timestamp t on the basis of lags. Look at the formula below for a better understanding —
Here, y(t) = predicted value at timestamp t, α = intercept term, β = coefficient of lag, and, y(t-1) = time-series lag at timestamp t-1.
So α and β are the model estimators that estimate y(t).
The Moving Average Model plays a similar role, but it does not take the past predicted forecasts into account, as said earlier in rolling average. It rather uses the lagged forecast errors in previously predicted values to predict the future values, as shown in the formula below.
Let’s see how both the AR and MA models perform on the International-Airline-Passengers data.
AR model
AR_model = ARIMA(indexedDataset_logScale, order=(2,1,0)) AR_results = AR_model.fit(disp=-1) plt.plot(datasetLogDiffShifting) plt.plot(AR_results.fittedvalues, color='red') plt.title('RSS: %.4f'%sum((AR_results.fittedvalues - datasetLogDiffShifting['#Passengers'])**2))The RSS or sum of squares residual is 1.5023 in the case of the AR model, which is kind of dissatisfactory as AR doesn’t capture non-stationarity well enough.
MA Model
MA_model = ARIMA(indexedDataset_logScale, order=(0,1,2)) MA_results = MA_model.fit(disp=-1) plt.plot(datasetLogDiffShifting) plt.plot(MA_results.fittedvalues, color='red') plt.title('RSS: %.4f'%sum((MA_results.fittedvalues - datasetLogDiffShifting['#Passengers'])**2))The MA model shows similar results to AR, differing by a very small amount. We know our data is non-stationary, so let’s make this RSS score better by the non-stationarity handler AR+I+MA!
ARIMA
Along with the squashed use of the AR and MA model used earlier, ARIMA uses a special concept of Integration(I) with the purpose of differentiating some observations in order to make non-stationary data stationary, for better forecasting. So, it’s obviously better than its predecessor ARMA which could only handle stationary data.
What the differencing factor does is, that it takes into account the difference in predicted values between two timestamps (t and t+1, for example). Doing this helps in achieving a constant mean rather than a highly fluctuating ‘non-stationary’ mean.
Let’s fit the same data with ARIMA and see how well it performs!
ARIMA_model = ARIMA(indexedDataset_logScale, order=(2,1,2)) ARIMA_results = ARIMA_model.fit(disp=-1) plt.plot(datasetLogDiffShifting) plt.plot(ARIMA_results.fittedvalues, color='red') plt.title('RSS: %.4f'%sum((ARIMA_results.fittedvalues - datasetLogDiffShifting['#Passengers'])**2))Great! The graph itself speaks how ARIMA fits our data in a well and generalized fashion compared to the ARMA! Also, observe how the RSS has dropped to 1.0292 from 1.5023 or 1.4721.
SARIMAX
Designed and developed as a beautiful extension to the ARIMA, SARIMAX or, Seasonal Auto-Regressive Integrated Moving Average with eXogenous factors is a better player than ARIMA in case of highly seasonal time series. There are 4 seasonal components that SARIMAX takes into account.
They are -
1. Seasonal Autoregressive Component
2. Seasonal Moving Average Component
3. Seasonal Integrity Order Component
4. Seasonal Periodicity
Source-Wikipedia
If you are more of a theory conscious person like me, do read more on this here, as getting into the details of the formula is beyond the scope of this article!
Now, let’s see how well SARIMAX performs on seasonal time-series data like the International-Airline-Passengers data.
from statsmodels.tsa.statespace.sarimax import SARIMAX SARIMAX_model=SARIMAX(train['#Passengers'],order=(1,1,1),seasonal_order=(1,0,0,12)) SARIMAX_results=SARIMAX_model.fit() preds=SARIMAX_results.predict(start,end,typ='levels').rename('SARIMAX Predictions') test['#Passengers'].plot(legend=True,figsize=(8,5)) preds.plot(legend=True)Look how beautifully SARIMAX handles seasonal time series!
Heading to DL Methods for Complex Time-Series ForecastingOne of the very common features of time-series data is the long-term dependency factor. It is obvious that many time-series forecasting works on previous records (the future is forecasted based on previous records, which may be far behind). Hence, ordinary traditional machine learning models like ARIMA, ARMA, or SARIMAX are not capable of capturing long-term dependencies, which makes them poor guys in sequence-dependent time series problems.
To address such an issue, a massively intelligent and robust neural network architecture was proposed which can extraordinarily handle sequence dependence. It was known as Recurrent Neural Networks or RNN.
Source-Medium
RNN was designed to work on sequential data like time series. However, a very remarkable pitfall of RNN was that it couldn’t handle long-term dependencies. For a problem where you want to forecast a time series based on a huge number of previous records, RNN forgets the maximum of the previous records which occurred much earlier, and only learns sequences of recent data fed to its neural network. So, RNN was observed to not be up to the mark for NSP (Next Sequence Prediction) tasks in NLP and time series.
To address this issue of not capturing long-term dependencies, a powerful variant of RNN was developed, known as LSTM (Long Short Term Memory) Networks. Unlike RNN, which could only capture short-term sequences/dependencies, LSTM, as its name suggests was observed to learn long as well as short term dependencies. Hence, it was a great success for modelling and forecasting time series data!
Note — Since explaining the architecture of LSTM will be beyond the size of this blog, I recommend you to head over to my article where I explained LSTM in detail!
Let us now take our Airline Passengers’ data and see how well RNN and LSTM work on it!
Imports —
import numpy as np import pandas as pd import tensorflow as tf import matplotlib.pyplot as plt import sklearn.preprocessing from sklearn.metrics import r2_score from keras.layers import Dense, Dropout, SimpleRNN, LSTM from keras.models import SequentialScaling the data to make it stationary for better forecasting —
minmax_scaler = sklearn.preprocessing.MinMaxScaler() data['Passengers'] = minmax_scaler.fit_transform(data['Passengers'].values.reshape(-1,1)) data.head()Scaled data —
Train, test splits (80–20 ratio) —
split = int(len(data[‘Passengers’])*0.8) x_train,y_train,x_test,y_test = np.array(x[:split]),np.array(y[:split]), np.array(x[split:]), np.array(y[split:]) #reshaping data to original shape x_train = np.reshape(x_train, (split, 20, 1)) x_test = np.reshape(x_test, (x_test.shape[0], 20, 1))RNN Model —
model = Sequential() model.add(SimpleRNN(40, activation="tanh", return_sequences=True, input_shape=(x_train.shape[1],1))) model.add(Dropout(0.15)) model.add(SimpleRNN(50, return_sequences=True, activation="tanh")) model.add(Dropout(0.1)) #remove overfitting model.add(SimpleRNN(10, activation="tanh")) model.add(Dense(1)) model.summary()Complie it, fit it and predict—
model.fit(x_train, y_train, epochs=15, batch_size=50) preds = model.predict(x_test)
Let me
Pretty much accurate!
LSTM Model —
model = Sequential() model.add(LSTM(100, activation="ReLU", return_sequences=True, input_shape=(x_train.shape[1], 1))) model.add(Dropout(0.2)) model.add(LSTM(80, activation="ReLU", return_sequences=True)) model.add(Dropout(0.2)) model.add(LSTM(50, activation="ReLU", return_sequences=True)) model.add(Dropout(0.2)) model.add(LSTM(30, activation="ReLU")) model.add(Dense(1)) model.summary()Complie it, fit it and predict—
model.fit(x_train, y_train, epochs=15, batch_size=50) preds = model.predict(x_test)Let me show you a picture of how well the model predicts —
Here, we can easily observe that RNN does the job better than LSTMs. As it is clearly seen that LSTM works great in training data but bad invalidation/test data, which shows a sign of overfitting!
Hence, try to use LSTM only where there is a need for long-term dependency learning otherwise RNN works good enough.
ConclusionCheers on reaching the end of the guide and learning pretty interesting kinds of stuff about Time Series. From this guide, you successfully learned the basics of time series, got a brief idea of the difference between Time Series Analysis and Forecasting subdomains of Time Series, a crisp mathematical intuition on Time Series analysis and forecasting techniques and explored how to work on Time Series problems in Machine Learning and Deep Learning to solve complex problems.
Hope you had fun exploring Time Series with Machine Learning and Deep Learning along with intuition! If you are a curious learner and want to “not” stop learning more, head over to this awesome notebook on time series provided by TensorFlow!
Feel free to follow me on Medium and GitHub for more articles and notebooks on Machine & Deep Learning! Connect with me on LinkedIn if you want to discuss anything regarding this article!
Happy Learning!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Related
Update the detailed information about Comprehensive Guide On Tcp/Ip Model on the Minhminhbmm.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!