Datadog & DevOps Level 2 Technical Questions/terminology
What are logs and why do we need to monitor them?
A log is a text file where applications and operating systems write events making it easier for engineers to get insight and identify the root cause of an issue. DD log management not only unifies metrics, and traces in a single view it allows you to troubleshoot, optimize performance and investigate security threats. Log management tools (DD included) allows you to make use of all these logs so teams can search filter and analyze logs vs. digging through a folder of text files looking for a needle in a haystack.
Do you support serverless?
AWS Lambda, yes. Pull through AWS crawler integration and offer tracing libraries or can connect to AWS X-ray libraries. All of this can be found in our serverless view. Azure, yes we support Azure App services, BETA now for support in serverless tab. Azure function info can be pulled from integration. Google functions, we can pull insight from our integration (metrics/logs) but do not support google functions in our serverless tab or offer APM.
Team has a containerized infrastructure, these containers shift from host to host, how does Datadog make sure the team can automatically identify the services running on a specific container?
Basic Agent autodicovery allows teams to do just this. Whenever a container starts the Datadog Agent identifies which services are running on this new container, looks for corresponding monitoring config and starts to collect metrics. Autodiscovery lets you define configuration templates for Agent checks and specify which containers each checks should apply to. You do need to turn this on within the datadog.yaml.
ECS Fargate vs Lambda
Both are proprietary to AWS and yes both are considered serverless offerings. Fargate and lambda are fundamentally different as serverless solutions and are priced differently in AWS & Datadog. AWS Fargate: run docker containers without managing any servers for container orchestration vs. running an EC2 and managing those containers. You may also hear these referred to as "ECS tasks" AWS Lambda: service that allows developers to run code in AWS without provisioning or managing servers. In the simplest use cases, developers can essentially upload their code and specify the amount of memory for a function and it will run when invoked (hence why we price per invocations). You can think of lambda functions as ephemeral - the code is only running when invoked and does not consume resources when it is not needed.
Logging without limits
DD is known for the coined term "logging without limits", the software is not proprietary the name is. LWL allows teams to decouple logs being ingested from logs that are indexed. The purpose here is to only pay for what you need. In addition to LWL we offer multiple retention filters, log rehydration, log caps, log patterns, the list goes on.
What is "Live containers"?
DD live containers enables real-time visibility into all containers across your environment.
Sensitive Data Scanner
DD sensitive Data scanner helps businesses meet compliance goals by discovering, classifying and hiding sensitive information within your log data in real-time. Datadog scans your logs for patterns of sensitive data upon ingestion and then hashes or redacts it following built-in or user-defined rules to help business stay compliant with SOC 2, GDPR, HIPAA, CCPA, and more. Priced at 0.30 per GB. You do have the ability to manually scrub your logs manually for example credit card numbers (no charge here).
CI Visibility
Datadog CI visibility brings together information about CI test and pipeline results plus data about CI performance, trends and reliability in one single place. **Tracing tests and tracing pipelines are priced separately **To instrument install agent on CI worker node or forward from environment variables for tests in containers. Once this is done instrument language. **To trace pipelines we support Buildkite, CircleCI, Github, Gitlab, Jenkins & custom commands - this is not agent based, this is done through an API Key
Session Replay
Datadog's session replay allows you to capture and visually replay the web browsing experience of your users. Combined with RUM performance data session replay is beneficial for error identification, reproduction, and resolution and provides insights into your web application's usage patterns and design pitfalls
Can Datadog replace Pager Duty/offer incident management?
Depends, on how the team is using Pager Duty today but in some cases - yes for some teams this may simply add additional value given you can send incidents created through DD directly through pager duty, can create a jira ticket or even set up a webhook to send incident notifications for example an SMS) DD's incident management offers teams a set framework for handling an incident from a graph, clipboard, slack, a monitor or on the incidents page.
docker vs kubernetes
Docker is an open source project that automates the deployment of applications inside software containers. Typically when teams move to microservices they implement containers. Docker containers run on a single node (node = host/EC2/VM) Kubernetes, while not the only container orchestration tool is an open-source system for automating container deployment, scaling and management across multiple nodes. Alternative options to Kubernetes (ECS, Docker swarm, openshift, rancher, fargate. AWS, azure and google also have their own partnership/managed kubernetes services(AKS, EKS, GKE) and yes we support all of these.
What is error tracking?
Have you heard customers reference any of the following: * Raygun * Sentry * Bugsnag * Rollbar * Airbrake Yep we can consolidate one more tool, Datadog offers error tracking allowing you to group similar errors into issues, follow issues over time and get additional context in one place. We offer Error tracking across web, mobile and backend applications - as a result this gets coupled with APM & RUM pricing depending on if looking to track backend, front end or both.
What are custom metrics?
If a metric is not submitted from one of more than 450 Datadog integration It's considered a custom metric. A custom metric is uniquely identified by a combination of a metric name and tag value. You get an allotment of free custom metrics based on your trim, however additional custom metrics can be purchased in 100 metric bundles.
Cloud functions, cloud run, Google app engine
If you hear any of these, think google. GAE, google app engine is supported in DD through an integration (aka free) we can deploy an agent here but likely cost limitations which prohibit us from doing so - no formal billing model/challenging to monitor outside of basic metrics. Cloud functions - true function comparative to Lambda we have an integration to pull metrics and logs but currently not supported in the serverless view. Cloud run - comparable to fargate, support through integration to pull metrics and logs but currently not supported in the same way as fargate/no specific billing here.
APM vs. Continuous Profiler
In tandem this provides teams with deep visibility into your application, web services, queues, and databased to monitor requests, errors and latency. With Datadog you can then seamlessly correlate to browser session, logs, code level profiles, synthetic checks, network, processes, infra metrics, etc. APM is responsible for end-to-end distributed tracing where teams often use it to retain errors and high latency to determine root cause of an application layer issue. Profiling/profiler gets one layer deeper allowing teams to analyze and compare code performance all the time and in any environment, including production, with little overhead. To do this Datadog continuously profiles each line of code to identify methods that are inefficient under production load & helps teams to optimize resource consumption and save on complute costs.
I see you have integrations with NR, Splunk & Solarwinds. Does this mean I can keep using these tools and see everything in a single pain of glass?
No. On the contrary you can push alerts to Datadog from each of these tools to show in your event stream but you will not have near feature parity of pushing everything to DD.
What is the difference between Datadog and prometheus?
Prometheus is an open source tool made by Sound cloud to collect metrics, prometheus does not provide OOTB dashboarding you then need to hook into another open source tool to graph (Graphana). You can push prometheus metrics to Datadog (these are billed as custom metrics) BUT Datadog can collect from these same sources through our agent, through agent based integrations, dogstatsD & our API. Datadog can typically be pitched as less manual & more OOTB for teams as they scale.
What is the difference between a service, resource, trace or span...
Service: Services are building blocks for modern microservice architectures, broadly a service groups together endpoints, queries or jobs. Typically people run multiple services per host. We do not bill per service we bill per host. Resource: represents a particular domain of a customer application, typically an instrumented web endpoint, database query or background job. Trace: A trace is used to track the time spent by an application processing a request and the status of this request. each trace consists of one or more spans. Spans: A span represents a logical unit of work in a distributed system for a given time period. Multiple spans construct a trace.
What are tags?
Tags are a way of adding dimensions to Datadog telemetries so they can be filtered, aggregated and compared in DD visualizations. The entire platform is built off of tags. Not only do we capture tags from native services we have suggestions for unified service tagging.
I want to use APM and not buy infra.
Talk to your manager, but this is not an approved use case of Datadog.
Security concerns with Agent/Datadog
The Datadog agent submits data to Datadog over a TLS-encrypted TCP connection by default. We provide ways to obfuscate sensitive information within local logs, a secrets management package & agent security scans. Today Datadog is SOC 2, HIPAA, GDPR compliant & Fedramp moderate.
What is a cluster agent and why do I need it?
The cluster agent provides a streamlined approach to collecting cluster level monitoring data. Using the cluster agent: * alleviates the impact of agents on your infra *enables the collection of cluster level data *allows you to leverage horizontal pod autoscaling with custom kubernetes metrics.
What is Trace Search and Analytics?
This is an included feature within DD APM offering to help customers manage the volume of spans that could be collected by DD. Unlike competitors we do not blindly sample up front, DD ingests 100% of all traces and has custom or smart filters to sample traces. We offer a live tail which you can search on and store indexed traces for 15 days to be searched on.
Can I use the agent on IoT devices?
We offer a specific IoT agent for IOT devices optimized for these devices and embedded applications.
Synthetics Monitoring
We offer two types of synthetic tests (API tests & browser) to help teams observe how your systems and applications are performing using simulated requests and actions from around the globe. AKA Catch something before a user does. * API tests monitor the uptime of you API endpoints. * Multistep API tests link several HTTP requests. * Browser Tests test key user journeys. All tests can be run from managed locations or from private locations to monitor internal-facing applications. All tests can be triggered manually, on a schedule or directly from you CI/CD Pipelines.
Does Datadog do alerting?
YES. This is a key component of the platform we offer a wide range of basic metric alerts to more advanced ML based alerts. You have the ability to hook in with collaboration tools like slack & teams. You also have the ability to hook in with tools like PagerDuty, Jira, OpsJenie etc.
For the kubernetes agent instal we use Helm, is this okay?
Yes you can run the agent in your kubernetes cluster as a Daemonset or you can deploy it with a helm chart. we have instructions for both in our documentation.
Do you support ECS?
Yes, ECS (Amazon elastic compute) is a highly scalable, container orchestration service that supports docker containers. With the Datadog agent you can monitor ECS containers and tasks on every EC2 instance in your cluster. Not to be confused with ECS fargate :)
Can we use our configuration tools we already have in place?
Yes, puppet, ansible, salt stack & chef are not only integrations but ways teams often mass deploy the DD agent.
Do you support Ubuntu?
Yes, the agent can be installed on many different platforms either directly on a host or as a containerized version. Most systems have a one-line install option. We support MacOS, windows, Linux, Ubuntu, amazon linux, debian, centos, redhat, docker, kubernetes, etc.
Do you offer SSO with SAML
Yes, we currently support Active Directory, Auth0, Okta, SafeNet, NoPassword, Google & Azure.
ECS on AWS fargate, do we support it?
Yes. Slightly different from a pricing and instrumentation standpoint but fully supported.
What is watchdog?
watchdog is an algorithmic feature for APM and infra metrics to automatically detect potential application and infrastructure issues.
What types of integrations do you offer - are the credential based or agent?
450+ and all of the above. Some integrations are credential based like AWS and slack while others may require some additional configuration in the DD Yaml such as SQL & Postgres. Integrations often come with OOTB dashboards.
Azure App Services
Azure App Services is considered a PAAS (Platform as a service) that runs web, mobile, API and business logic across applications and automatically manages the resources required by those apps. People using Azure consider this to be serverless, technically the functions within the app service are the serverless piece. We do now support azure app services with our "serverless offering" and metrics will show up in the serverless tab of the platform however we bill per host NOT per invocations. Aka we bill at the host rate per azure app service, same for APM.
Cloud security posture management
CSPM, a part of our cloud security platform performs configuration checks across your CLOUD accounts, hosts and containers. Scanning is continuous and surveys every resource then auto populates in a report. We also provide benchmarking based on set standards.
Cloud Workload Security
CWM, a part of our cloud security platform performs deep, in-kernal analysis of workload activity across your hosts & containers to uncover threats at runtime. Currently support Linux OS, Windows is to come shortly.
What does the DD mobile app do?
Datadog mobile app enables you to view alerts from Datadog on your mobile device. When receiving an alert via slack, e-mail, pagerduty or other pager apps, you'll be able to investigate issues by opening monitor graphs and dashboards on your device.
Datadog Security Monitoring - also known as "Threat Detection"
Datadog security monitoring is one product in our Cloud security platform, which provides robust threat detection. Security monitoring is run on top of ingested logs so you can analyze operational and security logs in real time. To do this we have curated OOTB rules w/a number of integrations to detect threats quicker. Priced at $0.20 per GB
RUM
Datadog's RUM gives you end-to-end visibility into the real-time activity and experience of individual users. RUM is designed to solve four types of use cases for web and mobile applications: 1. Performance: Track the performance of web pages, mobile applications screens, user actions, network requests, and your front-end code. 2. Error Management: Monitor the ongoing bugs and issues and track them over time and versions. 3. Analytics/Usage: Understand who is using your application, monitor individual user journeys and analyze how users interact with your application (most common page visited, clicks, interactions, and feature usage). 4. Support: Retrieve all of the information related to one user session to to troubleshoot an issue (session duration, pages visited, interactions, resources loaded & errors.)
Does Datadog offer Root cause analysis tools?
Datadog's entire platform is designed to speed up MTTR & root cause to detection, however we do have watchdog root cause analysis which can be used to identify causal relationships between different symptoms across your apm and infra.
what are live processes?
Datadog's live processes give you real-time visibility into the process running on your infrastructure. This is a Po+/enterprise feature. - View all of your running processes in one place. - Break down resource consumption on your hosts and containers at the process level. - Query for processes running on a specific host in a specific zone or workload - Monitor the performance of internal and third-party software you run using system metrics at two second granularity.
What does a single agent mean?
It means that we run 1 agent that captures data from your infrastructure, applications, logs, etc. By default when you turn the agent on, infrastructure is turned on however you can easily turn on live processes, NPM, APM, security & logs within the yaml by enabling each item from false to true.
NPM vs. NDM?
NPM (Network performance monitoring) is designed to give you visibility into your network traffic between services, containers, availability zones and any other tad in Datadog. this tool is designed to pinpoint unexpected or latent service dependencies, optimize costly cross regional or multi-cloud communication, identify outages of cloud provider regions and third party tools and troubleshoot client-side and server side DNS server issues. **We support linux and windows OS as well as containerized environments - not Mac OS. **Setup is done through our agent. **DNS monitoring (provides an overview of DNS server performance and flow-level DNS metrics) is included in NPM. NDM (Network Device monitoring) is designed to give you visibility into your on-premise and virtual network devises such as routers, switches and firewalls. We offer autodiscovery on any network to quickly start collecting metrics like bandwidth utilization & determine if devices are up/down. **Through a single agent you then can configure SNMP integration to either monitor individual decides or use device autodiscovery.
Do we have to use the Agent?
No customers can send data to Datadog by using a local agent or through our HTTP API . BUT, what are your concerns with using the agent? 99.9% of the time we need to give them facts and steer them toward the agent. Common reasons are security or performance both of which we can combat. If they do not use the agent for hosts, we can collect metrics through DogStatsD or our API however, these will all be considered custom metrics and cost $$$$$$$$$$ more than if they would use the agent.
Do I have to build dashboards in the UI?
No, you can do everything programmatically through the Datadog API including sending Data and graphing in JSON.
Do you offer RBAC?
Roles categorize users and define what account permissions those users have, such as what data they can read or what account assets they can modify. By default, Datadog offers three roles, and you can create custom roles so you can define a better mapping between your users and their permissions. By granting permissions to roles, any user who is associated with that role receives that permission. When users are associated with multiple roles, they receive all the permissions granted to each of their roles. The more roles a user is associated with, the more access they have within a Datadog account. Additionally you can create custom roles, reach out to our support team.
Serverless APM
Supported for Lambda, you can use X-ray hook in for some languages (DotNet) or our own tracing libraries supporting distributed tracing for AWS (Python, Node. JS, Ruby, GO or Java) - APM is included in the per 1M invocation price. Appservices, yes we can support this and bill for this like a standard host. Google serverless, no.
What is the purpose of the host/container map?
The host/continaer map gives you a big picture of health of your infra. Not only can you group by an array of tags, you can zoom in to better understand the health of this host or container like % CPU utilized. You can also color code
I want to use the Agent to collect logs but don't want to pay for infra?
This is allowed, we have docs that allow you to disable the payload.
What are the differences between Timeboards & Screenboards?
Timeboards have automatic layouts and represent a single point in time - either fixed or real-time across the entire dashboard. Commonly used for troubleshooting, correlation and general data exploration. Screenboards are dashboards with free-from layout which can include a variety of objects such as images, graphs, and logs. They are commonly used as status boards or storytelling views that update in real time or represent fixed points in the past.
Performance concerns with Agent
We offer a single agent for this reason, which is very light weight and is written in go, AVG less than 0.12% CPU. Enabling trace & process collection will inevitably increase resource consumption but again should not be noticeable.
Differences between Deep DB monitoring and DB monitoring supported through infra/APM
We support all major known databases through our OOTB integrations and collect basic performance related metrics such as query throughput, query performance, connections, etc. (Covered with infra) As mentioned above, given connected Datadog APM allows you to get visibility and follow stack traces through databases. Deep Database monitoring gets you even further into problematic queries with metrics such as average latency, total execution time, etc. Currently not all DBs are supported - check the docs.
Can we do things programmatically vs. using the UI?
Yes, we are a API first company. You can use the API to send data to Datadog, build data visualizations and manage your account. If you are a heavy terraform user, this may be a great option for teams. The Datadog terraform provider allows you to interact with the Datadog API through a terraform configuration. You can manage your Datadog resources such as dashboards, monitors, logs, etc.
Can you share dashboards with members outside of users?
Yes. You can create a public URL and share live dashboards with anyone outside of the org.
We use teams not slack is that okay?
Yes. we support both.