The Conference Program includes Invited Talks, Plenaries, and Mini Tutorials. Each of these sessions is designed to provide guidance from industry leaders, while offering actionable takeaways to immediately apply the knowledge and ideas gained.
A variety of topics are being covered at LISA17. Use the icons listed below to focus on a key subject area:
Follow the icons throughout the Conference Program below. You can combine days of training with days of Conference Program content to build the conference that meets your needs. Pick and choose the sessions that best fit your interests—focus on just one topic or mix and match.
LISA17 Program Grid
Download the program in grid format (PDF, updated 10/25/17).
Downloads for Registered Attendees
Wednesday, November 1, 2017
7:30 am-5:00 pm
On-Site Registration and Badge Pickup
Market Street Foyer
7:30 am–8:45 am
Continental Breakfast
Grand Ballroom Foyer
8:45 am–9:00 am
Opening Remarks
Grand Ballroom Foyer
LISA17 Co-Chairs Caskey Dickson, Microsoft, and Connie-Lynne Villani, Fastly, Inc.
9:00 am–10:30 am
Opening Plenaries
Security in Automation
Jamesha Fisher, GitHub
Jamesha Fisher, GitHub
Jamesha Fisher has worked in the Tech industry for over 10 years, with a keen eye towards security. Currently a Security Operations Engineer at GitHub, they have lent their security expertise throughout their career in Operations and Systems Engineering to other companies including Google and CloudPassage. In their spare time they are a maker of things musical, delicious, and objects that use binary numbers.
Leigh Honeywell, Technology Fellow, ACLU
Leigh Honeywell, Technology Fellow, ACLU
Leigh is a Technology Fellow at the ACLU’s Project on Speech, Privacy, and Technology. Prior to the ACLU, she worked at Slack, Salesforce.com, Microsoft, and Symantec. She has co-founded two hackerspaces, and is an advisor to several nonprofits and startups. Leigh has a Bachelors of Science from the University of Toronto where she majored in Computer Science and Equity Studies.
10:30 am–11:00 am
Break with Refreshments
Grand Ballroom Foyer
11:00 am–12:30 pm
Talks I
Never Events
Matt Provost, Yelp
The NHS is the United Kingdom's National Health Service, established in 1948 to provide free healthcare at point of service to all 64.6 million UK residents.
In England's National Health Service (NHS), a Never Event is a serious incident that "arise[s] from [the] failure of strong systemic protective barriers which can be defined as successful, reliable and comprehensive safeguards or remedies". They key criteria for defining Never Events is that they are preventable and have the potential to cause serious patient harm or death. All Never Events are reportable and undergo Root Cause Analysis to determine why the failure occurred, to prevent similar incidents from happening again.
Considering that the NHS is a healthcare service where incidents can obviously have serious, life-threatening or life-changing consequences, together with the scale of services provided (the NHS in England deals with over 1 million patients every 36 hours), their list of Never Events is actually quite short (14 events), including such items as “Wrong site surgery”, “Retained foreign object post-procedure”, and “Wrong route administration of medication.”
In our industry, the requirement for these events to be preventable would exclude things like DDOS attacks or security breaches which are outside of the SRE team's direct control. Of course steps should be taken to minimise or prevent these types of incidents, the same way that doctors work to prevent patients from dying of cancer. But they don't cause cancer, so a patient dying of it is not a Never Event. However, a nurse administering the wrong type of cancer medication, or cancer medication to the wrong patient, or delivering the medication via the wrong route (intravenous vs spinal etc) can all be Never Events.
If there are insufficient processes in place to prevent such mistakes, then they cannot be Never Events. This system is designed to protect the staff as well as patients, so that they aren't put under pressure to be perfect. There must be procedures in place so that it doesn't come down to an individual to make all of the correct choices on their own.
Events are a fundamental part of the safety culture of the NHS which is a "just culture that rejects blame as a tool." In recent years, modern systems safety concepts such as just culture and blameless postmortems have been introduced to the System Administration/Site Reliability Engineering/Devops community from other fields (such as healthcare). However the concept of defining specific Never Events has not been explored in this context and can bring similar benefits to those reported by the healthcare community with a reduction in the reoccurrence of such events.
Many systems engineering organisations already have their own formal or informal guidelines for reportable events. Publishing postmortems (either internally or public facing) is now becoming standard practise in our industry, but not all of these events are Never Events. These incidents should be studied by each organisation after each postmortem to generate a list of failures that should never occur again because safety systems/protective barriers have been put in place to prevent them. Any occurrence of such an incident after the fact is therefore a Never Event.
The goal of implementing the Never Events system is firstly to reduce the number of these serious events, but also to protect staff and to provide a safe working environment. Repeated Never Events indicate that management has not addressed the underlying causes of these incidents, which shifts responsibility away from the front line staff who are operating in (clearly) unsafe conditions or with inadequate safety systems in place to prevent these events.
While each organisation will come up with its own list of Never Events for their specific environment based on their examination and analysis of previous incidents, some generalisations can be made. For example, looking at “Wrong Site Surgery” from the NHS list, where the wrong part of the body is operated on (left vs right leg etc). This is a process failure, where the staff may do the correct procedure but to the wrong location. Transferring that to the systems administration world, this is analogous to running the correct command on the wrong system.
During their careers, most (if not all) system administrators have made certain classes of similar mistakes such as rebooting the wrong server, removing the wrong directory (including the classic "rm -rf /") or executing a SQL DELETE statement without a WHERE clause. We will examine the steps the NHS has taken to prevent this type of "wrong site" incident, along with other Never Events. By learning from other industries we can come up with recommendations for preventing similar mistakes in our field.
Matt Provost, Yelp
Matt Provost is an Engineering (SRE) Manager at Yelp, based in London. Prior to this he was the Systems Manager at Weta Digital in Wellington, New Zealand where he was responsible for the Top 500 supercomputers used to render such films as Avatar and the Hobbit trilogy. Matt has been a system and network administrator for over twenty years. He has a BA from Indiana University, Bloomington.
The Hidden Costs of On-Call: False Alarms
Cody Wilbourn, Parse.ly
On-call teams, postmortems, and costs of downtime are well-covered topics of DevOps. What's not spoken of is the costs of false alarms in your alerting. The team's ability to effectively handle true issues is hindered by this noise. What are these hidden costs, and how do you eliminate false alarms?
While you're at LISA17, how many monitoring emails do you expect to receive? 50? 100? How many of those need someone's intervention? Odds are you won't need to go off into a corner with your laptop to fix something critical on all of those emails.
Noisy monitoring system defaults and un-tuned alerts barrage us with information that isn't necessary. Those false alerts have a cost, even if it's not directly attributable to payroll. We'll walk through some of these costs, their dollar impacts on companies, and strategies to reduce the false alarms.
Cody Wilbourn, Parse.ly
Cody has been working in various operational roles for almost a decade, and been on call for most of that time. His background is in batch compute systems, having formerly managed storage and compute resources at Intel Austin for Atom processor design, and now helps provide realtime web analytics for some of the world's top news sites and publishers with Parse.ly. At Parse.ly, Cody reduced pager alerts by 85% and informational notifications by 70%, most of which were false alarms.
Talks II
UX Design and Education for Effective Monitoring Tools
Amy Nguyen, Stripe
The fastest way to become a 10X engineer is by enabling 10 other engineers to do their jobs better. As infrastructure engineers, part of our mission is to empower the rest of our engineering organization to use the tools we develop correctly, quickly, and independently. Yet we often fall short of that mission in unexpected ways. In this talk, I will explain ways to make concepts like interpolation, aggregation, and alerting more intuitive and how to identify pain points for new users. I'll go over common misconceptions users have about monitoring and how you can clear up this confusion with improved training and UI design.
Amy Nguyen, Stripe
Amy Nguyen is a software engineer passionate about making data understandable for everyone. In the past, she studied computer science and philosophy at Stanford University, served on the board of Stanford Women in Computer Science for three years, and helped in making computer science the most popular major for female undergraduates during her time there. Outside of work, Amy writes about the tech industry, loves baking, and reads too many self-improvement books.
ChatOps at Shopify: Inviting Bots in Our Day-to-Day Operations
Daniella Niyonkuru, Shopify
ChatOps has already been identified as instrumental for DevOps success. In this talk, I will describe how we use chatbots to accelerate developer onboarding, increasing developer productivity and manage service disruption incidents. ChatOps is about bringing tools into your conversations and using them to interact with the infrastructure. It traditionally combines a chatbot, key plugins and scripts. I will describe how we integrate these to perform actions related to the infrastructure such as rebalancing traffic, querying the infrastructure state, and other various actions.
Daniella Niyonkuru, Shopify
Daniella Niyonkuru is a Production Engineer at Shopify where she helps build a better, faster and more resilient platform. Previously, Daniella worked as an Aircraft System Software Specialist, and researched Formal Model Driven Development for Embedded Systems.
Mini Tutorials I
Automating System Data Analysis Using R
Robert Ballance, Independent Computer Scientist
Data analysis is not just about discovery, it’s about communication. The R programming language and ecosystem constitute a rich tool set for or automating the reporting process with reproducible and repeatable results. This 90-minute mini-tutorial will illustrate how the R data analysis pipeline can be applied to generating and delivering reports via documents, with a quick look at related techniques for the Web. The presentation will focus on the essence: automating the process of getting tables and graphics into the hands of users. Topics will include: accessing data stored in files and databases; scripting R to automate tasks; using document generation interfaces to generate reports; and applying R packages such as `brew` `xtable,` and `ggplot2` to make the process easy and supportable.
This mini-tutorial will:
- motivate you to pick up R
- illustrate ways to simplify your life by automating data analysis and reporting
- help you to communicate effectively with users and management using R as a platform
- facilitate the creation of automated analyses so that you and your staff can focus on the hard problems
Robert Ballance, Independent Computer Scientist
Dr. Robert Ballance recently completed a White House Presidential Innovation Fellowship where he applied his skills with R to analyzing and delivering broadband deployment data to communities across the U.S.A. He first developed his R-programming skills while managing large-scale High-Performance Computing systems for Sandia National Laboratories. While at Sandia, he developed several R packages used internally for system analysis and reporting. Prior to joining Sandia in 2003, Dr. Ballance managed systems at the University of New Mexico High Performance Computing Center. He has consulted, taught, and developed software, including R packages, PERL applications, C and C++ compilers, programming tools, Internet software, and Unix device drivers. He is a member of USENIX, the ACM, the IEEE Computer Society, the Internet Society, and the American Association for the Advancement of Science. He was a co-founder of the Linux Clusters Institute and recently served as Secretary of the Cray Users Group. Bob received his Ph.D. in Computer Science from U.C. Berkeley in 1989.
Mini Tutorials II
How to Get Out of Your Own Way
Jessica Hilt and Allison Flick, UC San Diego
This is a class about how to get better facilitating your own career success. Sysadmins and technologists are typically heads-down-work-hard types who do good work but struggle when it comes playing the business game to advancement and recognition of their work. This class will break-down some of the more cryptic areas of professional development into simple strategies that participants can implement immediately but, more importantly, won’t feel weird doing.
Jessica Hilt, UC San Diego
I’ve worked as a extroverted, career-savvy technologist for more than a decade in areas ranging from politics to start-ups. At the University of California, San Diego, I teach soft skills, community building, and storytelling to our technical community to increase adoption of new technologies, create consensus, and help people understand each other. I run a Women in Advanced Computing group that helps women with career development and advancement. I also create, develop and customize educational programs that suit the needs of the technologists at UC San Diego, including our own 500-person technical conference. I have been an invited speaker at several technical conferences and IT organizations including the City of San Diego, Intuit, USENIX LISA, and the UC-wide technical conference, UCCSC.
Allison Flick, UC San Diego
Allison Flick is an IT professional who has spent her career within higher education. After spending 10 years as a programmer and database designer, she has shifted her focus to supervision, systems administration, and user support. She is a mentor to many student employees and sysadmins in their early careers. In her spare time, she organize and run a Women in Advanced Computing group at UC San Diego that help women with career development and advancement.
12:30 pm–2:00 pm
Lunch at the Expo
Pacific Concourse
2:00 pm–3:30 pm
Talks I
SREBot—More Than a Chatbot—An Intelligent Bot to Crush Mitigation Time
Cezar Alevatto Guimaraes, Microsoft
SREBot is a knowledgeable and intelligent engine that replaces tribal knowledge and automates incident management activities. It is also extensible, allowing other teams to add their own knowledge. In this talk you will hear how SREBot is being developed and used to reduce the Time to Mitigate (TTM) Microsoft incidents. We will explain how it was designed and then share the main issues we are facing.
Cezar Alevatto Guimaraes, Microsoft
Cezar Guimaraes is a Site Reliability Engineer Lead on the Microsoft Azure team. He has more than 15 years of experience and has worked at Microsoft for 11 years as a Software Engineer. Currently, he is working on Azure to identify and resolve problems that stand in the way of service uptime through engineering solutions such as bots and intelligence/correlation engines.
Testing Before You Scale & Making Friends While You Do It
Renee Lung, PagerDuty
Your customers shouldn’t find problems before you do. When we develop software and make architectural decisions, we try to anticipate potential problems—ambiguous user interfaces, performance bottlenecks, and other edge cases. Generally we do a good job of it, but as system complexity grows, the mental models we use to plan and understand those structures don’t always adequately accommodate those complexities. So what do we do about this? We can test all the things! By using automation, we test complex scaling scenarios to validate our mental models and to identify unanticipated side-effects.
One of the issues we recently dealt with was supporting a major change in our traffic patterns. Although overall load stayed the same, the stress points produced by that load changed significantly. Major shifts like these always have the potential to disrupt our service, and in turn, disrupt our customers’ ability to keep their systems running. We had some predictions about how our system would react to the new load profile, but we wanted to validate those predictions ourselves rather than waiting for our customers to experience service degradation.
Although each engineering team had some idea of how these changes would affect the performance of their own services and had work scheduled to address those issues, I wanted to make sure we were all equipped to make informed prioritization and planning decisions. All I had to do was figure out a way to consolidate the efforts of more than 90 engineers into one focused attack on our scaling challenges.
Fortunately, I didn’t have to start from scratch: I could build on existing attitudes of collaboration, ownership, and a culture of reliability which has resulted in a rich toolset for testing resilience and scalability. This talk will outline how we used those tools, developed new ones, what we learned in the process, and the challenges of consolidating the efforts of separate teams towards a specific, common initiative.
Renee Lung, PagerDuty
I’m a full-stack engineer at PagerDuty, and I work on one team in a fairly large engineering department. One of the things I love most about my job is that I get to work on back end services to make sure all the wiring and plumbing is doing its job, but also I get to do some front-end development so I can see my code in action. Working at PagerDuty is my first experience with DevOps, so in addition to learning a lot about the systems that back up my code, I’ve also learned to really appreciate the work my colleagues do and the services they are responsible for. Before discovering how much I love programming, I was a graduate student, a bread baker, and a graphic designer. When I'm not lost in the endless tubes of internet, I'm playing roller derby, cross-stitching, or watching Star Trek with my cats.
Scalability Is Quantifiable: The Universal Scalability Law
Baron Schwartz, VividCortex
Do you know what scalability really is? It's a mathematical function that's simple, precise, and useful. REALLY useful. It describes the relationship between system performance and load. In this talk you'll learn the function (the Universal Scalability Law), how it describes and predicts system behavior you see every day, and how to use it in practice. I'll show you how to understand the function, how to capture the data you need to measure your own system's behavior (you probably already have that), and how to analyze the data with the USL. You'll leave this talk knowing exactly what scalability is and what causes non-linear scaling. There are two factors, and you'll start seeing those everywhere, too. As a result, when systems don't scale you'll know what kind of problem to look for, and you'll avoid building bottlenecks into your systems in the first place. Final note: this talk requires zero mathematical skill.
Baron Schwartz, VividCortex
Baron is the founder and CEO of VividCortex, the best way to see what your production database servers are doing. He is the author of High Performance MySQL and many open-source tools for MySQL administration. He's also a frequent participant in many database communities, including Postgres, Redis, MongoDB, and more.
Talks II
Working with DBAs in a DevOps World
Silvia Botros, Sr. DBA at Sendgrid
DevOps is about breaking silos. Bringing everyone to the table to bring more value to the company. But how does that fit with specialists on a team like DBAs who, by definition, are a silo of specific knowledge?
Trick question! I don't think DBA's are 'by definition a silo'. I have been a DBA with outdated expectations of my roles in the past. And if you would like to know how to promote collaboration with your DBA team, I have some stories to share! In this talk, I will show you how to help your DBA get involved early in your feature planning, and how to draw on their expertise and use their knowledge to turn good performance and operationality into v1 features and not add-ons.
I will draw from my experience as the only DBA in a rapidly growing company that was also learning how to DevOps as I was learning what the word means and doesn't mean. I will give examples on how to grow the relation between the DBA and the engineering teams to bring stronger collaboration. And I will share lessons learned from both projects that went well and some that had bumps on the road and why they did.
Silvia Botros, Sr. DBA at Sendgrid
Silvia Botros is a Sr Database engineer at SendGrid, a cloud email provider for household names like Spotify, Pandora, Airbnb and Ebay. In her spare time she is busy with 3 Jr DBAs at home (start them early!).
Queueing Theory in Practice: Performance Modeling for the Working Engineer
Eben Freeman, Honeycomb.io
Cloud! Autoscaling! Kubernetes! Etc! In theory, it's easier than ever to scale a service based on variable demand. In practice, it's still hard to take observed metrics, and translate them into quantitative predictions about what will happen to service performance as load changes. Resource limits are often chosen by guesstimation, and teams are likely to find themselves reacting to slowdowns and bottlenecks, rather than anticipating them. Queueing theory can help, by treating large-scale software systems as mathematical models that you can rigorously reason about. But it's not necessarily easy to translate between real-world systems and textbook models. This talk will cover practical techniques for turning operational data into actionable predictions. We'll show how to use the Universal Scalability Law to develop a model of system performance, and how to leverage that model to make more informed capacity planning and architectural decisions. We'll discuss what data to gather in production to better inform its predictions -- for example, why it's important to capture the shape of a latency distribution, and not just a few percentiles. We'll also talk about some of the limitations and pitfalls of performance modelling.
Eben Freeman, Honeycomb.io
Now largely reformed after stints studying theoretical mathematics and climbing rocks, Eben is fascinated by tools that help humans better understand the systems they create. He works as an engineer at Honeycomb.io.
Distributed Tracing: From Theory to Practice
Stella Cotton, Heroku
Application performance monitoring is great for debugging inside a single app. However, as a system expands into multiple services, how can you understand the health of the system as a whole? Distributed tracing can help! You’ll learn the theory behind how distributed tracing works. But we’ll also dive into other practical considerations you won’t get from a README, like choosing libraries for your polyglot systems, infrastructure considerations, and security.
Stella Cotton, Heroku
Stella Cotton is a Tools engineer at Heroku and co-founder of AndConf and Fog City Ruby. She loves good abstractions and boring technology.
Mini Tutorials I
Handling Emergency Changes and Urgent Requests
Jeanne Schock, Armstrong Flooring Inc.
IT emergencies and urgent requests are not the same thing, yet they are often both handled the same way. Emergency changes are required to repair or minimize outages of critical services. Urgent requests come from VIPs, rushed project deadlines or human error. They require normal changes to be expedited. Emergencies and urgent requests are unfortunate realities for which we need to be prepared.
Jeanne Schock, Armstrong Flooring Inc.
Jeanne Schock has a background in Linux/FreeBSD/Windows system administration that includes working at a regional ISP, a large video hosting company and a Top Level Domain Registry services and DNS provider. About 7 years ago she transitioned to a role building and managing processes in support of IT operations, disaster recovery, and continual improvement. She is a certified Expert in the IT Infrastructure Library (ITIL) process framework with in-the-trenches experience in Change, Incident, and Problem Management. Jeanne also has a pre-IT academic and teaching career and is an experienced trainer and public presenter.
Mini Tutorials II
S, M, and L Logstash Architectures: The Foundations
Jamie Riedesel, HelloSign
LogStash can scale. From all-in-one boxes (S) to architectures that involve routing log-lines to separate parsing clusters run by different business units (L), LogStash can do it. In this talk, I will be going over the foundations of LogStash architectures; such as the components of LogStash, where it can be deployed, working with ElasticSearch, and an overview of human interfaces to this data.
Jamie Riedesel, HelloSign
Jamie Riedesel is a DevOps Engineer at HelloSign and has been performing acts of systems administration and engineering since 1997, and more dev-like things since 2010. She moved from corporate IT to the startup space in 2010 and experienced the good kind of culture shock. Jamie has been blogging as sysadmin1138 since 2004, a community elected moderator on ServerFault since 2010, and awarded the Chuck Yerkes community award by LOPSA in 2015.
3:30 pm–4:00 pm
Break with Refreshments at the Expo
Pacific Concourse
4:00 pm–5:30 pm
Talks I
The 7 Deadly Sins of Documentation
Chastity Blackwell, Yelp
Documentation, or the lack of it, is often one of the biggest issues with working in tech. In most places, code has supremacy, and documentation ends up being an afterthought. Unfortunately, even in places where documentation is actually written, it’s often done quickly or poorly in the first place, not maintained, or not organized in a way that makes it easy to use. This talk will discuss the biggest problems surrounding creating, maintaining, and providing utility with documentation, and how to solve them.
Chastity Blackwell, Yelp
Chastity Blackwell took her first job as a system administrator in 1999 just to pay the bills until she could get a writing job. After 12 years working in infrastructure operations at the University of Illinois, she decided this might actually be a career, and was lured out to the Bay Area to work for a startup. She survived a yearlong stint as a manager before returning to the front lines as a Site Reliability Engineer at Yelp.
Persistent SRE Antipatterns: Pitfalls on the Road to Creating a Successful SRE Program Like Netflix and Google
Blake Bisset; Jonah Horowitz, Stripe
People aren't just wrong on the internet. Sometimes they bring it back to the office. We're here to debunk the biggest traps we've stepped in, spent good drink money learning about from other people who'd stepped in them, or seen someone who hadn't stepped in them yet propose as good practice. Save yourself some pain. Or just laugh at ours. The talk addresses specific anti-patterns we've seen in building teams and systems to manage service delivery for very large scale operations, and more appropriate ways to approach those issues.
Blake Bisset
Blake Bisset got his first legal tech job at 16. He won’t say how long ago, except that he’s legitimately entitled to make shakeyfists while shouting “Get off my LAN!” He’s done 3 start-ups (a joint venture of Dupont/ConAgra, a biotech spinoff from the U.W., and this other time a bunch of kids were sitting around New Year’s Eve, wondering why they couldn’t watch movies on the Internet), only to end up spending a half-decade as an SRM at YouTube and Chrome, where his happiest accomplishment was holding the go/bestpostmortem link for several years.
Jonah Horowitz, Stripe
Jonah Horowitz is a Site Reliability Engineer with Stripe. He works with all of the individual engineering teams at Stripe to drive reliability efforts. This includes monitoring, alerting, deployment pipelines and chaos resiliency. Before coming to Stripe he worked at several startups around the Bay Area including: Netflix, Quantcast - a leading ad-tech startup where he grew their network to process over 3 million events per second, Looksmart - a contextual advertising company, and he was on the founding team of Wal-Mart.com (now Walmart Labs) where he built out their software deployment pipelines and their product image management systems.
Talks II
Becoming a Plumber: Building Deployment Pipelines
Daniel Barker, DST Systems
A core part of our IT transformation program is the implementation of deployment pipelines for every application. Attendees will learn how to build abstract pipelines that will allow multiple types of applications to fit the same basic pipeline structure. This has been a big win for injecting change and updating legacy applications.
Daniel Barker, DST Systems
Dan spent 12 years in the military as a mechanic on fighter jets, like the F-16, before transitioning to a career in technology as a Software Engineer, then a DevOps Engineer, and now a Software Development Manager. He’s leading a team of engineers dedicated to bringing DevOps principles and practices to a financial and health services company. This team is core to a multi-million dollar transformation program. Dan is also an organizer of the DevOps KC Meetup and the DevOpsDays KC conference.
Have You Tried Turning It Off and Turning It On Again?
Tanya Reilly, Google
Most of us have a backup strategy and many of us have a restore strategy and several of us have a fully tested restore strategy. But backups are far from the whole story! This talk covers the parts of disaster recovery you might be less prepared for, and the dependencies that you might not think about until one day when you really do turn an entire service, entire site or (perish the thought!) an entire company off and on again.
We'll look at why the best laid fallback plans tend to go wrong, and why you should start deliberately managing your dependencies long before you think you need to. And we'll look at dependency cycles that make it difficult or impossible to restart groups of systems. Like, where do you store the documentation on how to recover the documentation server?
Tanya Reilly, Google
Tanya Reilly has been a Systems Administrator and Site Reliability Engineer at Google since 2005, working on low level infrastructure like distributed locking, load balancing and bootstrapping. Before Google, she was a Systems Administrator at eircom.net, Ireland's largest ISP, and before that she was the entire IT Department for a small software house.
Mini Tutorials I
Enhancing Monitoring with Spatial Data and Maps
Derek Arnold
In the world of operations; monitoring plays a crucial part. In some cases, monitoring based on location is needed. The volume of interconnected devices only seems to growing. It may be useful to monitor spatial data to complement any other data one may collect. There are plenty of ways to visualize this data but, as a fan of the medium, my first thought is "what about maps?"
My mini-tutorial hopes to provide insight into the ways that data users can use spatial data to add geographic awareness to their data. There are many packages and languages that can be utilized for this task. I will use tools such as OpenLayers and spatial extensions to existing databases (E.g. PostGIS, MariaDB,etc.), to:
- Show how data can be stored in a spatially aware manner
- Introduce GIS standards that illuminate ways to organize the spatial components
- Present methods for acquiring spatial data
- Adding spatial data to current data
- And of course...how to map this data in a clear and concise manner as a complement to other visualization methods
Derek Arnold[node:field-speakers-institution]
Derek Arnold has worked in many different parts of technology across multiple sectors as a system administrator, developer and instructor in the telecommunications, manufacturing, education and government sectors for the last 20 years.
Mini Tutorials II
S, M, and L Logstash Architectures: Reaching for the Sky
Jamie Riedesel, HelloSign
LogStash can scale. From all-in-one boxes (S) to architectures that involve routing log-lines to separate parsing clusters managed by diverse departments (L), LogStash can do it. If you have the foundations of LogStash down, we can talk about scaling it up. From architectures with syslog as the collector and LogStash purely as a parser, to architectures where LogStash is acting as both collector and parser, you will run into scaling issues as you get bigger. We will go over scaled up and out architectures, and equip you with the knowledge of what XL might look like for you. Also, scale means more than events per second. Scale can also mean maintaining multiple years of certain types of logs. As you scale through time, you will face upgrade problems. Are you still on LogStash 1.5 because 2.x requires ElasticSearch 2.x? Or LogStash 2.4 because 5.x requires ElasticSearch 5.x? We will go over techniques to upgrade your deep history and get your architecture closer to ‘latest’.
Jamie Riedesel, HelloSign
Jamie Riedesel is a DevOps Engineer at HelloSign and has been performing acts of systems administration and engineering since 1997, and more dev-like things since 2010. She moved from corporate IT to the startup space in 2010 and experienced the good kind of culture shock. Jamie has been blogging as sysadmin1138 since 2004, a community elected moderator on ServerFault since 2010, and awarded the Chuck Yerkes community award by LOPSA in 2015.
6:00 pm–7:00 pm
Expo Happy Hour
Pacific Concourse
7:00 pm–11:00 pm
Birds-of-Feather Sessions
View the full schedule of BoFs on the LISA17 BoFs page.
Thursday, November 2
8:00 am–5:00 pm
On-Site Registration and Badge Pickup
Market Street Foyer
8:00 am–9:00 am
Continental Breakfast
Grand Ballroom Foyer
9:00 am–10:30 am
Talks I
Disaggregating the Network: Switching as a Service
Nina Schiff, Facebook
At Facebook, we’ve traditionally focused on disaggregation through most of our systems. This has helped us to iterate faster, harden where needed and scale out our bottlenecks more easily. However, in the network, we have had very little control over the switching ecosystem, making us reliant on the timelines of other companies. Adaptability and customization are not typically what comes to mind when people think about network switches. Hardware is often proprietary, and if you're buying a vendor switch, you don't control the frequency or speed of new features or bug fixes. These constraints are inconvenient at best, particularly for large production environments. This led us to try something different - disaggregating the hardware components and software workflow, into Wedge and FBOSS respectively. We also moved to make our switches look significantly more like traditional servers. While this has brought new (and definitely interesting) challenges, it has also meant that we’ve been able to piggyback off advances in server management. This talk takes a look at this composite architecture within our production setting while examining the lessons we learnt along the way. It also highlights how having a server as a switch helps us iterate faster, provides a more reliable network and meets the scaling demands of Facebook’s ever-increasing traffic growth.
Nina Schiff, Facebook
Nina is a software engineer working on Facebook's disaggregated network switches, making sure the packets go where they should. Before that she spent time working on deployment, containers and the occasional site outage.
LinkedIn's Distributed Firewall
Mike Svoboda, LinkedIn, and Nils Christian Roscher-Nielsen, Zener
Distributed Firewall (DFW) has fundamentally altered LinkedIn's System, Network, and Security Operations. This technology has enabled LinkedIn to expand with unbound horizontal scalability by leveraging Software Defined Networking. Combining system automation with host based firewalls, DFW has not only allowed LinkedIn to alter the physical network design, but it has also increased the security protections that we can now provide in Production environments.
In this presentation, we will share how LinkedIn was able to remove physical and logical network firewall bottlenecks. By shifting network security enforcement down to the per-host level, DFW enables LinkedIn to fully utilize datacenter power, cooling, and space facilities by intermixing heterogeneous environments within the same physical rack and network footprint. Integrating DFW with LinkedIn's code deployment system, the firewall has become aware of the specific application requirements on each node, and can build a unique security profile to secure the hosted services.
We will demonstrate DFW in action, point to the open source code, and will share lessons learned from our Production implementation so other organizations could leverage this technology.
Mike Svoboda, LinkedIn
Mike Svoboda is a Senior Staff Engineer, working in Production Operations at LinkedIn for the past seven years. Mike has built or has been involved with most of LinkedIn's configuration management infrastructure using the CFEngine framework.
Talks II
Fast and Safe Production Monitoring of JVM Applications with BPF Magic
Sasha Goldshtein, Sela Group
All of us have seen these evasive performance issues or production bugs in the field, which standard monitoring tools don't see or catch. BPF is a Linux kernel technology that enables fast, safe, dynamic tracing of a running system without any preparation or instrumentation in advance. The JVM itself has a myriad of insertion points for tracing garbage collections, object allocations, JNI calls, and even method calls with extended probes. When the JVM tracepoints don't cut it, the Linux kernel and libraries allow tracing system calls, network packets, scheduler events, off-CPU time, time blocked on disk accesses, and even database queries. In this talk, we will see a holistic set of BPF-based tools for monitoring JVM applications on Linux, and revisit a systems performance checklist that includes classics like fileslower, opensnoop, and strace—all based on the non-invasive, fast, and safe BPF technology.
Sasha Goldshtein, Sela Group
Sasha Goldshtein is the CTO of Sela Group, a Microsoft MVP and Regional Director, Pluralsight author, and international consultant and trainer. Sasha is the author of two books and multiple online courses, and a prolific blogger. He is also an active open source contributor to projects focused on system diagnostics, performance monitoring, and tracing -- across multiple operating systems and runtimes. Sasha authored and delivered training courses on Linux performance optimization, event tracing, production debugging, mobile application development, and modern C++. Between his consulting engagements, Sasha speaks at international conferences world-wide.
Charliecloud: Unprivileged Containers for User-Defined Software Stacks in HPC
Michael Jennings, Los Alamos National Laboratory
Supercomputing centers are seeing increasing demand for user-defined software stacks (UDSS), instead of or in addition to the stack provided by the center. These UDSS support user needs such as complex dependencies or build requirements, externally required configurations, portability, and consistency. The challenge for centers is to provide these services in a usable manner while minimizing the risks: security, support burden, missing functionality, and performance. We present Charliecloud, which uses the Linux user and mount namespaces to run industry-standard Docker containers with no privileged operations or daemons on center resources. Our simple approach avoids most security risks while maintaining access to the performance and functionality already on offer, doing so in just 900 lines of code. Charliecloud promises to bring an industry-standard UDSS user workflow to existing, minimally altered HPC resources.
Michael Jennings, Los Alamos National Laboratory
Michael Jennings has been a UNIX/Linux Systems Administrator and a C/Perl developer for over 20 years and has been author of or contributor to numerous open source software projects including Eterm, Mezzanine, RPM, Warewulf, and TORQUE. Additionally, he co-founded the Caos Foundation, creators of CentOS, and has been lead developer on 3 separate Linux distributions. He currently works as a Scientist at Los Alamos National Laboratory and is the primary author/maintainer for the LBNL Node Health Check (NHC) project. He is also the Vice President of HPCXXL, the extreme-scale HPC users group.
Mini Tutorials I
Handling the Interruptive Nature of Operations
Carolyn Rowland, National Institute of Standards and Technology, and Avleen Vig, Facebook
The interrupt-driven nature of Operations can create a high-stress, low-productivity workplace, while providing the illusion of high-productivity. We come up with our own coping mechanisms that sometimes make the situation worse. Working remote creates additional challenges. We will discuss the nature of operations work and how engineers and managers can both work to manage interruptions and create a better environment. This will be an updated version of our LISA 2014 mini-tutorial.
Carolyn Rowland, National Institute of Standards and Technology (NIST)
Carolyn is an IT/Dev manager at NIST. She provides the calming influence to the chaos introduced in this tutorial.
Avleen Vig, Facebook
Avleen is a Production Engineer at Facebook, where he helps scale Facebook’s infrastructure. Before joining Facebook he worked at several large tech companies, including EarthLink, Google, and Etsy.
Mini Tutorials II
WordPress Maintenance and Troubleshooting
Dash Buck
WordPress is a well-established and popular open source Content Management System. Normally taking care of a WordPress website should be the responsibility of a content manager and a front end developer, but if you're IT and you're "it," what do you do? This class will cover WordPress basics as well as maintenance, troubleshooting, and when it's time to call in a (different) professional.
Dash Buck[node:field-speakers-institution]
Dash Buck began developing WordPress websites in early 2014. They give their clients the ability to confidently communicate with developers. Dash lives in Seattle with two adorable cats and one adorable human. When not coding, Dash reads, parolas en Esperanto, and yells excitedly about science facts. (Did you know that there's more water in the Earth's mantle than in all surface oceans combined!?) http://emdashbuck.com/pronouns
10:30 am–11:00 am
Break with Refreshments at the Expo
Pacific Concourse
11:00 am–12:30 pm
Talks I
Case Study: Deploying a Multi-Region, Highly Available MySQL Architecture
Gabriel Ciciliani, OSDB Internal Principal Consultant at Pythian
A customer recently asked us to design a multi region database architecture that allows their application to read by default from a local database instance while writing on a single master region. It would also need an automated way to handle failures on any of the regional database instances by redirecting both, read and write traffic to an available region.
In this session we are going to go through the architecture designed to fulfill the above requirements, what technologies were considered and why ProxySQL was chosen.
We will also also discuss advantages and limitations of the proposed architecture while sharing a few lessons learned in the process.
Gabriel Ciciliani, OSDB Internal Principal Consultant at Pythian
Gabriel has been dedicated to databases as a DBA and consultant for the last 10 years. He has lead and participated in multiple projects across many technologies, including Oracle, MySQL, SQL Server and MongoDB. Gabriel defines himself as an automation super fan, he is one of the developers of the MySQL/MongoDB DBaaS solution currently in use in MercadoLibre, the largest e-commerce platform in Latin America and top ten world wide. Gabriel holds a college degree in electronics, a degree in industrial engineering and he is currently studying for his masters in systems information engineering. He is also an Oracle, AWS and Microsoft certified professional. Currently he is a Internal Principal Consultant at Pythian specializing in MySQL and MongoDB.
Stories from the Trenches of Government Technology
Matt Cutts, Acting Administrator, US Digital Service, and Raquel Romano, Engineering Lead, Digital Service at Veterans Affairs
Since 2014, the US Digital Service has worked to improve government services from healthcare.gov to small businesses to veterans getting their benefits. Come hear some of the frustrating, surprising, and gratifying stories that we’ve seen as technologists trying to make government work better.
Matt Cutts, Acting Administrator, US Digital Service
Matt worked on Google’s web search for over 16 years, where he created and led Google’s webspam team. In 2016, he joined the US Digital Service and quickly got hooked on its impact. He now lives in DC with his wife and two very lazy cats.
Raquel Romano, Engineering Lead, Digital Service at Veterans Affairs
Raquel spent nearly a decade as a software engineer at Google before becoming obsessed with how government delivers critical services to millions of people. She currently builds software that Veterans and their families rely on for such benefits as health care, education assistance, and disability compensation. She is a remote member of the USDS team at the Veterans Affairs department and lives in the San Francisco Bay Area with her spouse, children, and a highly unusual collection of pets.
Resiliency Testing with Toxiproxy
Jake Pittis, Shopify
Fibers get cut, databases crash, and you’ve adopted Chaos Engineering to challenge your production environment as much as possible. But what are you doing to craft the resiliency test suites that minimizes the impact of your application as much as possible? How do you debug resiliency problems locally, and make sure to architect for robustness as development-time? Toxiproxy is an open-source tool we’ve used for the past 2 years to emulate timeouts, latency and outages and one we believe could benefit nearly every company faced with these issues. In this talk, we’ll dive into practical tips, lessons learned, and best practices so you can use Toxiproxy to write resilient test suites.
Jake Pittis, Shopify
In between teaching his team about jazz, Jake can be found on the Production Engineering Team at Shopify. He's worked preparing the platform for massive celebrity sales, making Shopify run out of multiple data centres, and the resiliency stack to protect the app against misbehaving resources, and itself. Canadian Geese are his favourite animals. While the hipster movement of his nation has recently taken to eating these poor birds, Jake has yet to taste one. And never plans to. We don't eat our friends.
Talks II
Operational Compliance: From Requirements to Reality
Trevor Vaughan, VP Engineering - Onyx Point, Inc.
A mere mention of compliance is one of those things that makes most teams throw up their hands in frustration. We would like to share how our Government and Industry customers have successfully approached Compliance Driven Operations and how to use standard development and engineering methodologies to address compliance concerns in a practical manner. Specific techniques and technologies will be mentioned that can help teams approach Compliance as ‘just another set of requirements’ and understand how to communicate effectively with security teams and auditors.
Trevor Vaughan, VP Engineering - Onyx Point, Inc.
One of the co-founders of Onyx Point, Inc., Trevor has been working in various systems administration and automation related fields for over 20 years. Recently, he has been focusing on automated compliance for Federal and commercial systems and helped start Open Source SIMP project to help provide that capability to the widest audience possible.
Fast Log Analysis Made Easy by Automatically Parsing Heterogeneous Logs
Biplob Debnath and Will Dennis, NEC Laboratories America, Inc.
Existing log analysis tools like ELK (Elasticsearch-LogStash-Kibana), VMware LogInsight, Loggly, etc. provide platforms for indexing, monitoring, and visualizing logs. Although these tools allow users to relatively easily perform ad-hoc queries and define rules in order to generate alerts, they do not provide automated log parsing support. In particular, most of these systems use regular expressions (regex) to parse log messages. These tools assume that the administrators know how to work with regex, and make the admins manually parse and define the fields of interest. By definition, these tools support only supervised parsing as human input is essential. However, human involvement is clearly non-scalable for heterogeneous and continuously evolving log message formats in systems such as IoT, and it is humanly impossible to manually review the sheer number of log entries generated in an hour, let alone days and weeks. On top of that, writing regex-based parsing rules is long, frustrating, error-prone, and regex rules may conflict with each other especially for IoT-like systems. In this talk, we describe how we automatically generate regex rules based on the log data, which is described further in our research work, LogMine: Fast Pattern Recognition for Log Analytics, published at the CIKM 2016 conference. We also show a demo to illustrate how to integrate our solution with the popular ELK stack.
Biplob Debnath, NEC Laboratories America, Inc.
Dr. Biplob Debnath is a researcher at NEC Labs, where his works over the last six years have spanned building end-to-end log analtytics solutions, non-volatile memory-systems, and systems for data deduplication. His work on log analytics ships in NEC's Log Analysis Service. His PhD research on flash based key-value stores ships in Bing ObjectStore, research on data deduplication ships in Windows Server 2012, and research on caching ships in IBM Storage Array. Biplob received a Ph.D. and an M.S. from the University of Minnesota.
Will Dennis, NEC Laboratories America, Inc.
Will Dennis has been employed at NEC Laboratories America for the last 10 years, currently as a Sr. Systems Administrator in the central Information Technology Services group. In the two decades before his employment with NEC Labs, he held various IT/Operations roles in banking, healthcare and web application development (startup). Will is an avid learner and enjoys working with the many disparate technologies in use in an industrial lab setting.
Your Secrets in Cloud-Based Key Management Services
Dan O'Boyle, Stack Overflow
Do you encrypt secrets before committing them to a repository?
Are API keys and passwords stored in a local library any team member can decrypt?
Are you forced to re-encrypt all secrets anytime access has changed?
Stop doing those things! Cloud Based Key Management Services (Google KMS, Azure Key Vault, Amazon KMS) provide encryption keys as a service. KMS create a centralized access control list. Using a KMS, you can centralize secrets, removing them from local libraries. Key rotation can be automated, often times making a KMS more secure than local key management practices.
Dan O'Boyle, Stack Overflow
Dan works as an Internal Support Engineer on the IT team at Stack Overflow. He started his career as high school teacher and transitioned into a System Administrator. He enjoys creative collaboration to solve solvable things, and using automation for everything else.
Mini Tutorials I
Care & Feeding of a Healthy Job Hunt
VM Brasseur, Freelance Open Source Consultant
Perhaps you're bored at your current job. Perhaps you're new to the tech job market. Perhaps your company lost funding and you were laid off.
There are a million reasons why, but one common element: Sometimes we all have to play the job hunt game. And the game sucks.
From unresponsive recruiters to pointless interview questions, a job hunt can be a demoralizing and dehumanizing process, but there are a lot of things which you can do to make it more productive and less stressful. Some of the things I'll cover in this tutorial:
- Finding the good job postings
- Resume and Cover Letter dos and don'ts
- Organize Organize Organize
- You rule the interview
- Negotiating the offer
- And more!
I've nearly 20 years of experience in tech, most of them on the other side of the table as a hiring manager. I've also spent many years doing career coaching for job seekers. I'll teach you the things which will help get your application noticed in a good way.
VM Brasseur, Freelance Open Source Consultant
In VM (aka Vicky)'s nearly 20 years in the tech industry she has been an analyst, programmer, product manager, software engineering manager, director of software engineering, and C-level technical business. She is now a consultant on open source strategy, policy, and processes. Vicky is the winner of the Perl White Camel Award (2014) and the O'Reilly Open Source Award (2016).
Vicky occasionally blogs at anonymoushash.vmbrasseur.com, often writes and is a community moderator for opensource.com, and frequently tweets at @vmbrasseur.
Mini Tutorials II
Writing and Consuming REST Services
Chris St. Pierre, Cisco Systems, Inc.
REST services are widely used for interaction with and between applications and for systems management tasks. This mini-tutorial offers a quick introduction to how REST services are structured, for both the implementer and the client. We will cover the use of HTTP verbs, the architecture of URIs, maintenance of state, middleware, and more.
My presentation from last year can be viewed at https://stpierre.github.io/REST
Chris St. Pierre, Cisco Systems, Inc.
Chris St. Pierre is currently serving the fourteenth year of a life sentence to hard labor at the command line. He works as an software engineer at Cisco focusing on CI/CD for OpenStack, Docker, and Kubernetes. In his spare time he is a bicycle advocate and civic hacker.
12:30 pm–2:00 pm
Lunch at the Expo
Pacific Concourse
2:00 pm–3:30 pm
Talks I
Now You See Me Too: Visual Tooling for Advanced System Analysis
Suchakrapani Sharma, ShiftLeft Inc.
Command line tools ensure lowest friction and entry bar for system analysis. However, visual analysis yields more information in a shorter amount of time. As an example, when an application crashes or an elusive transient bug occurs, understanding of callstack that led to the anomaly is a valuable information. Recording such function call graphs of the application and displaying them on the command line as huge chunks of text has been a common occurrence and a quick resort for such analyses. However, methodical analysis requires better visuals. Modern representations, such as FlameGraphs, FlameCharts, and Sun-bursts in such cases, have shown how effective the same analysis can be, when represented visually. However, there are hundreds of techniques to gather trace/debug data, and understanding of what visual tool to represent which data can be a daunting task. This talk focuses on the various visual tools available for common system analysis and debugging scenarios. We explore some open source tools used in system tracing and the representation formats for such data comping from multiple sources such as LTTng and eBPF. We explore historical origins of such visual representations and see the evolution of research ideas to concrete modern tools. We also discuss how in a few minutes you can easily enhance the same tools and develop new views to visualize a wide range of data—from network capture, Container/VM tracing to even hardware traces coming directly from CPUs—all in the same tool.
Suchakrapani Sharma, ShiftLeft Inc.
Suchakra is currently a Scientist at ShiftLeft Inc. He completed his PhD in Computer Engineering from École Polytechnique de Montréal where he worked on eBPF and hardware-assisted tracing techniques for advanced systems performance analysis. He has been involved in research on performance analysis domain for last 4 years and has delivered talks on systems analysis at Tracing Summit 2015 (LinuxCon, Seattle), TracingSummit 2016 (Embedded LinuxCon, Berlin) and FUDCon 2015 (Pune) where he has demonstrated advanced kernel and userspace tracing tools in a very "friendly manner". He has developed one of the first hardware-trace based VM analysis techniques, and wants to see to it that systems analysis tools are ready for the future. He is also a member of Linux Foundation's IOVisor Project and a contributor to the BPF Compiler Collection. In the past, he has been involved in biomedical and automotive electronics as an embedded Linux engineer. More information about him can be found at : https://suchakra.wordpress.com/about/
Vax to K8s: Ticketmaster's Transformation to Cloud Native Devops
Heather Osborn, Ticketmaster
When you have a 40 year old company deeply rooted in legacy technologies, the work required to reinvent is dramatic. I will share how we’ve handled this journey so far, our successes and failures, and where we’re going in the future. Moving from a siloed on-premises environment to a DevOps cloud-native company has not been without discomfort. The time-to-market improvements and increased visibility of problems has evolved us into a more agile company that has the potential to keep pace with startups.
Heather Osborn, Ticketmaster
Heather Osborn has been working in technology as a system and operations engineer for the last 25 years. Although not common in the tech world, she's stayed with Ticketmaster for the last 19 years through the various incarnations - partly because of multiple technology reinventions and unique challenges, and partly because she wants to see what will happen next. She's looking forward to this new era of public cloud and container orchestration.
Heather is an avid long distance runner who has lots of time to think about these things while pounding the pavement.
Talks II
Linux Container Performance Analysis
Brendan Gregg, Netflix
Containers pose interesting challenges for performance monitoring and analysis, requiring new analysis methodologies and tooling. Resource-oriented analysis, as is common with systems performance tools and GUIs, must now account for both hardware limits and soft limits, as implemented using cgroups. A reverse diagnosis methodology can be applied to identify whether a container is resource constrained, and by which hard or soft resource. The interaction between the host and containers can also be examined, and noisy neighbors identified or exonerated. Performance tooling can need special usage or workarounds to function properly from within a container or on the host, to deal with different privilege levels and name spaces. At Netflix, we're using containers for some microservices, and care very much about analyzing and tuning our containers to be as fast and efficient as possible. This talk will show you how to identify bottlenecks in the host or container configuration, in the applications by profiling in a container environment, and how to dig deeper into kernel and container internals.
Brendan Gregg, Netflix
Brendan Gregg is an industry expert in computing performance and cloud computing. He is a senior performance architect at Netflix, where he does performance design, evaluation, analysis, and tuning. He is the author of Systems Performance published by Prentice Hall, and received the USENIX LISA Award for Outstanding Achievement in System Administration. Brendan has created performance analysis tools included in multiple operating systems, and visualizations and methodologies for performance analysis, including flame graphs.
The Actor Model and the Queue or “Batch is the New Black”
James Whitehead II, Chief Scientist, Formularity
This presentation will explain how two simple, decades-old computer paradigms can be combined and used to build the world’s largest and most resilient computing solutions. Real world examples will be shown.
In 1974, Carl Hewitt published his paper on the Actor model. In computing, an Actor is a computer program that uses information fed to it through messages to 1) create new Actors, 2) send messages to other Actors, and 3) make limited, often binary, decisions. Just as the binary on-off state of a single transistor can be built into the 2.6 billion(!) transistor Intel i7 microprocessor, Actors can be built into the most complex processing systems. If the Actor model sounds familiar, it’s because it is the basis for Microservices, one of the hottest new topics in cloud computing. Just another example that “…what has been will be again, what has been done will be done again; there is nothing new under the sun.”
The Actor Model is only half of the solution. The key to using Actors to build infinitely scalable real-world systems is how you connect them together. Typically, in Microservices, you send or “push” messages from one Microservice to another. When you reach the throughput of a Microservice instance, you clone a few more instances. When you reach the CPU or memory utilization limits of the virtual machine, you fire up more VMs. The key is that you “push” messages. This however, is the wrong approach. We all know what happens when you push something hard enough—it will fall over. Think of the classic scene from the “I Love Lucy” television program where Lucille Ball is wrapping chocolate candies on a conveyer belt. This graphically demonstrates that the “push” model is the wrong approach.
In Douglas Adam’s “The Hitchhiker’s Guide to the Galaxy”, the quote is “We'll be saying a big hello to all intelligent lifeforms everywhere and to everyone else out there, the secret is to bang the rocks together, guys.” To paraphrase Mr. Adams, the secret to scalable processing systems is really to “pull”, not “push” messages between Actors.
Rather than send messages directly between Actors, the messages are deposited into queues from which Actors can “pull” messages. As each Actor becomes available, it pulls the next message out of the queue and processes it. This has a number of advantages over “pushing” messages, such as increased Actor process stability, load balancing, predictive monitoring, and transparent redundancy.
Actors are computer programs and as such they aren’t lazy. An Actor will process messages as fast as its execution environment permits. If messages begin to back up in a queue, then you know, long before it becomes critical, that more Actor processes are required. As these new Actor processes become available, there is no need to add them to a load balancer. Each new Actor connects to the same queue and starts asynchronously removing and processing messages. Similarly, when queues become empty, redundant Actors can be terminated. Finally, by using network routing, it’s possible to route messages to redundant queues. If the primary queue fails, Actors can “failover” to a redundant queue and continue processing without message loss.
While the Actor model is 42 years old, the queue data structure was originally described by Alan Turing 70 years ago, in a paper published in 1947!
While these two “ancient” computing paradigms form the basis for modern, infinitely scaling systems, there are a number of details that must be dealt with, including how to handle work lost when Actors fail; how to maintain state or context; how to handle long-running processes; how to handle “split brain” network failures in light of redundant messages queues; synchronization of redundant message queues, etc. This presentation will discuss these issues as well. The goal of the presentation is to outline for software developers, the framework they can use to develop highly scalable, highly resilient processing systems.
James Whitehead II, Chief Scientist, Formularity
Brad Whitehead is Chief Scientist for Formularity, an electronic forms company dedicated to the secure collection and processing of personal information. Formerly, he was a Partner and Master Technology Architect with Accenture. Brad has architected and implemented several national-scale information processing systems based on the Actor model and queues. One such system processes billions of biometric transactions per day for the Republic of India, while another handles millions of biometric identification transactions each day while safeguarding the borders of the United States. He has served as a security advisor to several US Federal agencies, including the Department of Homeland Security, the Department of Defense, and the United States Postal Service. Brad holds a BS from Carnegie Mellon University and an MS from the University of Liverpool. He can be reached at brad.whitehead@formularity.com.
Mini Tutorials I
osquery—Windows, macOS, Linux Monitoring and Intrusion Detection
Teddy Reed and Mitchell Grenier, Facebook
This workshop is an introduction to osquery, an open source SQL-powered operating system agent for host visibility and analytics. Osquery was created by the Facebook Security team and is actively developed by Facebook and the open source community. It is currently used by many companies for collecting host forensics and proactively hunting for abnormalities. Osquery makes it easy to ask targeted or broad questions about your heterogeneous infrastructure. This workshop is a very hands-on training and we expect participants to be comfortable with CLI. The workshop is broken into three components:
Part I - osquery zero -> hero: The first section of the workshop will make use of the interactive osquery command line tool (osqueryi) to explore your operating system. The goal of this section is to get participants familiar with writing SQL statements and to understand how osquery makes use of core tables to abstract operating system concepts.
Part II - osquery at scale: The second part of the workshop will focus on automation and deployment of osquery at a larger scale. You will learn how to configure the osquery daemon (osqueryd) and to write “query packs”. The daemon is a persistent agent that logs events and state changes according to a schedule of queries. Packs are used to share sets of common queries.
Part III - File Integrity Monitoring (FIM), Linux process auditing, and Windows event log collection: The last part of the workshop focuses on three parts of osquery's eventing features: FIM, process auditing, and Windows event logs. You will add several paths to your configuration and begin collecting hashes when files are updated or created. You will start collecting all execve, bind, and connect syscall arguments on Linux and track inbound SSH connections and the tree of processes launched. You will learn about Windows event logs and how to audit all Powershell executions.
Teddy Reed, Facebook
Teddy is a Security Engineer at Facebook developing tools to help protect the company. He is very passionate about trustworthy, safe, and secure code development. Ask me about: osquery, firmware, secure roots of trust, and operating system hardening.
Mini Tutorials II
The Ins-and-Outs of Networking in the Big Three Clouds
Chris McEniry, Sony Interactive Entertainment
The big three public cloud providers—AWS, Azure, and GCP—each provide a form of a private network inside of their public clouds. While the fundamental usage is the same, the implementations differ and have different constraints. This tutorial will compare and contrast the different cloud provider networking models, and provide approaches to interconnecting the clouds within each, across each, and with outside resources such as traditional data centers.
Chris McEniry, Sony Interactive Entertainment
Chris "Mac" McEniry is a practicing sysadmin responsible for running a large ecommerce and gaming service. He's been working and developing in an operational capacity for 15 years. In his free time, he builds tools and thinks about efficiency.
3:30 pm–4:00 pm
Break with Refreshments
Grand Ballroom Foyer
4:00 pm–5:30 pm
Plenary Panel
Scaling Talent: Attracting and Retaining a Diverse Workforce
Moderator: Tameika Reed, Founder of WomenInLinux
Panelists: Derek Arnold; Amy Nguyen, Stripe; Qianna Patterson, QP Advisors; Wayne Sutton, Co-Founder, CTO, Change Catalyst
Derek Arnold has worked in many different parts of technology across multiple sectors as a system administrator, developer and instructor in the telecommunications, manufacturing, education and goverment sectors for the last 20 years. |
|
Amy Nguyen is a software engineer passionate about making data understandable for everyone. In the past, she studied computer science and philosophy at Stanford University, served on the board of Stanford Women in Computer Science for three years, and helped in making computer science the most popular major for female undergraduates during her time there. Outside of work, Amy writes about the tech industry, loves baking, and reads too many self-improvement books. |
|
Qiana Patterson is a seasoned tech executive, specializing in K12 education, higher education and workforce development. With her, she brings over 10 years experience in the education sector and a wealth of leadership and project/product management expertise in the technology industry. She was the founding COO for Edlio an LA-based K12 edtech company, prior she served as the Interim CEO of Educational Networks a leading content management software platform company. While at Educational Networks, she served as a lead manager in almost all areas and teams of the company. Before Educational Networks, Qiana worked as a teacher and Dean of Students in the Los Angeles Unified School District. She currently leads her own tech consulting firm, QP Advisors, where she helps startups to mature companies develop products customers love. |
|
Wayne Sutton is a serial entrepreneur and co-founder of Change Catalyst and its Tech Inclusion programs. Change Catalyst is dedicated to exploring innovative solutions to diversity and inclusion in tech through the Tech Inclusion Conference, training, workshops, and the Change Catalyst Startup Fellows Program. Sutton’s experience includes years of establishing partnerships with large brands to early stage startups. As a leading voice in diversity and inclusion in tech, Sutton shares his thoughts on solutions and culture in various media outlets where he has been featured in TechCrunch, USA Today, and the Wall Street Journal. In addition to mentoring and advising early stage startups, Sutton’s life goal is to educate entrepreneurs who are passionate about using technology to change the world. Wayne is a 2017 New America CA Fellow. |
6:00 pm–8:00 pm
Conference Reception
Atrium
8:00 pm–11:00 pm
Birds-of-Feather Sessions
View the full schedule of BoFs on the LISA17 BoFs page.
Friday, November 3
8:00 am–12:00 pm
On-Site Registration and Badge Pickup
Market Street Foyer
8:00 am–9:00 am
Continental Breakfast
Grand Ballroom Foyer
9:00 am–10:30 am
Talks I
An Internet of Governments: How Policymakers Became Interested in “Cyber”
Maarten Van Horenbeeck, Fastly, Inc.
Gradually, the internet has become a bigger part of how we socialize, do business, and lead our daily lives. Though they typically do not own much of the infrastructure, governments have taken ever-increasing note, often aspirational, and sometimes with suspicion. In this talk, we’ll cover how governments internationally debate and work on topics of cybersecurity, agree on what the challenges are, and get inspiration on solutions. The talk will show how these concerns often originate from domestic concerns, but then enter several processes in which governments meet, debate, agree, and disagree on their solutions. You’ll learn about initiatives such as the ITU, the UNGGE, the Global Conference on Cyberspace, and the Internet Governance Forum, and how you as an engineer can contribute!
Maarten Van Horenbeeck, Fastly, Inc.
Maarten Van Horenbeeck is Vice President of security engineering at Fastly, a content delivery network that speeds up web properties around the world. He is also a board member and former chairman of the Forum of Incident Response and Security Teams (FIRST), the largest association of security teams, counting 300 members in over 70 countries. Previously, Maarten managed the Threat Intelligence team at Amazon and worked on the Security teams at Google and Microsoft. Maarten holds a master’s degree in information security from Edith Cowan University and a master’s degree in international relations from the Freie Universitat Berlin. When not working, he enjoys backpacking, sailing, and collecting first-edition travel literature.
Clarifying Zero Trust: The Model, the Philosophy, the Ethos
Evan Gilman; Doug Barth, Stripe
The world is changing, though our network security models have had difficulty keeping up. In a time where remote work is regular and cloud mobility is paramount, the perimeter security model is showing its age—badly.
We deal with VPN tunnel overhead and management. We spend millions on fault-tolerant perimeter firewalls. We carefully manage all entry and exit points on the network, yet still we see ever-worsening breaches year over year. The Zero Trust model aims to solve these problems.
Zero Trust networks are built with security at the forefront. No packet is trusted without cryptographic signatures. Policy is constructed using software and user identity rather than IP addresses. Physical location and network topology no longer matter. The Zero Trust model is very unique, indeed.
In this talk, we'll discuss the philosophy and origin of the Zero Trust model, why it's needed, and what it brings to the table.
Evan Gilman[node:field-speakers-institution]
Evan Gilman is a Network Engineer turned SRE. With experience in protocol analysis, distributed systems design and network security, Evan has been building systems in untrusted networks his entire life. An open source contributor, author, and speaker, Evan's passion lies in designing systems which strike a balance with the network they run on.
Doug Barth, Stripe
Doug is a software generalist with extensive operational experience. Currently an SRE at Stripe, Doug has run the gamut of technical responsibility. Having previously worked with Evan Gilman as an SRE at PagerDuty, he and Evan are co-authors of the upcoming O'Reilly book "Zero Trust Networks".
Talks II
Coherent Communications—What We Can Learn from Theoretical Physics
Kevin Barron, University of California, Santa Barbara
In the tech world we typically focus almost exclusively on instrumental communication—because once we have nailed our communications objective in unambiguous, non-jargon, we feel we can precisely communicate with our clientele, and team members. And yet we fail—often spectacularly. Then we blame all the wrong things: the clients did not take enough interest, the team members were distracted or went off-message. On the other hand, we sometimes experience what seem to be spontaneous moments of clarity and free-flowing ideas, but rarely consider what enabled it. To better understand this dynamic, we need to step back and take the end-to-end view. In other words, use the same troubleshooting methods we would apply to a technical problem. Once we take a broader systemic view, we can remove the problems, and actively promote coherent communication.
Kevin Barron, University of California, Santa Barbara
Kevin Barron is IT Director at the Institute for Theoretical Physics, located at the University of California, Santa Barbara campus. He's been working in advanced network activities since 1983, including instigating the first dark fiber, customer-owned network in the country: CENIC connects all EDU (universities and K-12) and many government agencies in California. He was the founder and chair of the Santa Barbara Broadband Coalition, and is generally the chief trouble-maker when it comes to internet issues. He is a contributing author of two best-selling books on the Internet, and has given numerous seminars, classes, and presentations on the subject of networking.
Pintrace: A Distributed Tracing Pipeline
Suman Karumuri, Pinterest
Speed improves customer engagement. With the emergence of micro services, it is very common for a single customer interaction, such as loading the home page or querying a search end point, to invoke hundreds of calls to dozens of back-end services. In this multi-tenant environment, traditional monitoring and profiling tools can't tell us why a specific request was slow.
Distributed tracing is the only tool available today that lets us trace a request across several systems. Using the gathered traces, we can correctly debug how a specific request is processed across the service, understand where an application spent most of its time and gain insight into why a particular request was slow.
In this talk, I will present PinTrace, our zipkin based distributed tracing infrastructure. I will also talk about the challenges of instrumenting and deploying the tracing in a polyglot micro-services architecture at scale. I will also share a few examples of how we use traces from production to debug p99 latency issues, identify unnecessary network calls and performance bottlenecks in the system. I will conclude the talk with a few use cases of distributed tracing beyond performance optimization like architectural visualization.
Suman Karumuri, Pinterest
Suman Karumuri is the lead for distributed tracing at Pinterest. Previously, he served as the lead for Zipkin project at Twitter. He is the author of an upcoming book Distributed Tracing from O’Reilly.
Mini Tutorials I
Chef: Scripts to Recipes
Morgan Drake, Chef Software
When learning system automation, one of the most difficult learning curves new users face is the step between picking up a basic understanding of the automation language and understanding how to apply that language to their real-world infrastructure problems. This short session takes you through a true-to-life example of converting a Tomcat runbook with step-by-step installation instructions into a Chef recipe that automates the installation process.
All Chef tooling used is available as part of Chef’s open-source product suite.
Chef will be providing remote instances for you to work along. You will need a laptop and SSH client. This course assumes novice familiarity with the Linux command line and an editor such as Vim, Emacs, or Nano. Due to the time limitations of the course, we will be unable to provide tutoring for students not familiar with these tools.
While this course is specific to configuring Tomcat, the topics covered are applicable to many common automation tasks. We’ll also teach you how to use Chef’s documentation to keep learning as you take these skills back to your own infrastructure.
Morgan Drake, Chef Software
Morgan Drake is a Chef Solutions Architect, Linux engineer, and CS master’s student. She’s all in on enabling individuals and organizations to build faster, more reliable, and more humane IT infrastructures, and you’ll see her running Chef’s training courses, meetups, and weekly open source Office Hours. She has extensive professional experience in the academic and nonprofit sectors, and continues to volunteer in academia with a focus on enabling the technical skill sets of nontraditional students and dislocated workers. In her free time you’ll often find her watching professional wrestling while perfecting the art of hand embroidery.
Mini Tutorials II
Kubernetes: Hit the Ground Running
Chris McEniry, Sony Interactive Entertainment
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. This introduction to Kubernetes will show you what it's like to be a user of Kubernetes, and identify concerns for starting a Kubernetes environment. Using minikube, you'll bring up a sample application and see how Kubernetes can be used to help with the ongoing operations of it. You'll examine what makes a good match for Kubernetes, and what concerns need to be addressed for other applications. You'll run through a set of concerns and constraints for what it would take to provide a Kubernetes environment.
Chris McEniry, Sony Interactive Entertainment
Chris "Mac" McEniry is a practicing sysadmin responsible for running a large ecommerce and gaming service. He's been working and developing in an operational capacity for 15 years. In his free time, he builds tools and thinks about efficiency.
10:30 am–11:00 am
Break with Refreshments
Grand Ballroom Foyer
11:00 am–12:30 pm
Talks I
DevOps in Regulatory Spaces: It's Only 25% What You Thought It Was…
Peter Lega, Merck and Company
You’ve embraced the DevOps concepts, found your sympaticos, established a solid technical ecosystem and culture, and even delivered some great early results with a first follower portfolio. Now, you have entered the mission-critical regulatory problem space at scale. The traditional DevOps “goodness” and culture have taken you this far. Now it’s time to scale, with a whole new set of regulatory and compliance constituents and technical maturity needs.
In this talk, we will share the “first contact” experience with the long established regulatory community as we embarked on delivering larger complex solutions and the challenges and compelling opportunities to transform that have unfolded to enable compliance as code from portfolio through production.
Peter Lega, Merck and Company
Pete Lega is Director of Emerging Technology at Merck & Company, where he has been shepherding the DevOps program over the last 3 years.
Before this position, he led Merck’s enterprise web and mobility services, as well as providing swat-team expertise to ongoing tactical technical initiatives.
Prior to joining Merck, he was VP of technical architecture at c|net networks where he led development of several franchise sites (shareware.com, download.com, and buydirect.com). He also held senior roles including Technical Director at Digital Equipment, Divisional CIO at Bear Stearns, and was a rapporteur on Digital Content for the European Commission.
Peter holds a B.Sc. in Computer Science from Moravian College.
Failure Happens: Improving Incident Response in Large-Scale Organizations
Damon Edwards, Rundeck, Inc.
Deployment is a solved problem. Yes, there is still work to be done, but the operations community has successfully proven that we can both scale deployment automation and distribute the capability to execute deployments. Now, we have to turn our attention to the next critical constraint: What happens after deployment?
We all know that failure is inevitable and is coming our way at any moment. How do respond quickly and effectively to those failures? What works when there is just a small set of teams or an isolated system to manage will quickly break down when the organization grows in size and complexity. But on the other hand, what has been commonly practiced in large-scale enterprises is proving to be too cumbersome, too silo dependent, and simply too slow for today's business needs.
How do we rapidly respond to incidents and recover complex interdependent systems while working within an equally complex and interdependent organization? How does operations embrace the DevOps and Agile inspired demand for speed and self-service while maintaining quality and control?
This talk examines the trial-and-error lessons learned by some forward-thinking enterprises who are currently streamlining how they:
- Resolve incidents
- Reduce friction between teams
- Divide up operational responsibilities
- Improve the quality of their ongoing operations.
See how these companies are rethinking how and where operations happens by applying Lean and DevOps principles mixed with modern tooling practices.
This talk will:
- Dissect examples of operational incidents from inside actual large enterprises
- Identify the common organizational and technical anti-patterns that prevent quick and effective incident resolution and interfere with organizational learning
- Discuss emerging design patterns and techniques that remove the friction and bottlenecks while empowering teams (highlighting publicly referenceable work shared with the DevOps community)
Damon Edwards, Rundeck, Inc.
Damon Edwards is a Co-Founder of Rundeck, Inc., the makers of Rundeck, the open orchestration and scheduling platform. Damon Edwards was previously a Managing Partner at DTO Solutions, a DevOps and IT Operations improvement consultancy. Damon has spent over 15 years working with both the technology and business ends of IT operations and is noted for being a leader in porting cutting-edge DevOps techniques to large enterprise organizations. Damon is also a frequent conference speaker and writer who focuses on DevOps and operations improvement topics. Damon is active in the international DevOps community, including being a co-host of the DevOps Cafe podcast with John Willis, an early core organizer of the DevOps Days conference series, and a content chair for Gene Kim’s DevOps Enterprise Summit.
Talks II
Capacity and Stability Patterns
Brian Pitts, Eventbrite
At Eventbrite, engineers are tasked with building systems that can withstand dramatic spikes in load when popular events go on sale. There are patterns that help us do this:
- Bulkheads: partitioning systems to prevent cascading failures
- Canary testing: Slowly rolling out new code
- Graceful degradation: turning functionality on and off in response to failures or load
- Rate limiting: controlling the amount of work you accept
- Timeouts: limiting time you wait for a request you made to complete
- Load shedding: purposefully not handling some requests in order to reserve resources for others
- Caching: saving and re-serving results to lessen expensive requests
- Planning: getting the resources you need in place, before you need them
In this talk you learn about each of those patterns, how Eventbrite has adopted them, and how to implement them within your own code and infrastructure.
Brian Pitts, Eventbrite
Brian studied political science in college, but when his thesis contained more python code than prose it was clear where his true loyalties lay. He’s worked in operations roles for the past eight years and currently carries the pager for Eventbrite. He lives in Nashville with his wife, son, two cats, and collection of 1990s Unix workstations.
"Don't You Know Who I Am?!" The Danger of Celebrity in Tech
Corey Quinn, Last Week in AWS
Thought Leaders. DevOps Heroes. Public Speakers. We listen to them as they talk about their solutions, their approaches, and their inevitable triumphs. But are we starting down a dark path, as we forget that 'what makes a great talk' and 'what makes sense for your environment' may not be the same thing? In this entertaining and slightly irreverent talk, the speaker discusses the dangers of taking others' experiences as a source of absolute truth. A discussion of how the innovative and clever solutions that headline various conference talks may very well not apply to your environment will ensue, including but not limited to:
- Matching business requirements to what technologies can deliver
- The trap of feeling like you're falling behind if you're not doing what the "bleeding edge" companies are
- At least one story of how following this approach went hilariously wrong
Corey Quinn, Last Week in AWS
Principal at The Quinn Advisory Group, Corey has a history as an engineering manager, public speaker, and advocate for cloud strategies which speak to company culture. He specializes in helping companies control and cost optimize their AWS cloud footprint without disrupting the engineers using it.
Outside of his professional work, Corey is known for overdressing, telling entertaining stories, and carrying a cigarette case full of drink umbrellas.
Mini Tutorials I
Containers at Scale with Kubernetes, Docker, and Azure
Jennelle Crothers, Microsoft
Container technology allows you to achieve greater density on your hosts, reduce conflicts between dev/test/prod environments and increase deployment speed. In this session you'll learn how to easily go from using Docker containers on your workstation with Docker for Mac/Windows to bringing those containers to your datacenter or cloud provider (either on IaaS or a container service) and deploy them at scale using Docker swarms or Kubernetes. Bring your Azure subscription (or trial) and deploy containers to your own Kubernetes cluster in almost no time!
Jennelle Crothers, Microsoft
Jennelle Crothers is a Microsoft Technical Evangelist who likes computer networking, server administration, dogs, quilting, popcorn and on most days, public transportation. Before joining Microsoft, Jennelle Crothers spent 15 years as a Systems Administrator "jack of all trades" overseeing Windows domains, Exchange Server, desktops and other IT systems where she struck fear into the hearts of end-users with complex password policies and email retention tags. Now she supports platform awareness for Azure, Windows Server and related technologies, containers and DevOps. Jennelle is a Microsoft Azure Specialist and prior to joining Microsoft, she was a four-time Microsoft MVP.
When not thinking about technology, Jennelle volunteers with Guide Dogs for the Blind and sneaks away to read dystopian novels.
Mini Tutorials II
HPC for Everybody
Cory Lueninghoener, Wireless Couch Labs
High performance computing (HPC) spans a variety of topics from standard system administration to large scale system design and development, and the topic is relevant to more than just the scientific computing community. The successful HPC admin or admin team needs to have some understanding of topics such as system architecture, scalability, parallel filesystems, networking, job scheduling, and software development tools to provide good support to their customers. This minitutorial will be a survey of tools, techniques, and concepts that can be used to get a new HPC capability started or give a boost to existing HPC installations, whether the customers are running scientific research, product design, safety simulations, or any other type of HPC job. The tutorial will include an organized slide deck with more detail than will be covered in the 90 minutes that can be used as reference material describing currently existing tools, what they are useful for, and where to find more information about them.
Cory Lueninghoener, Wireless Couch Labs
Cory Lueninghoener is a large-scale systems guy who has helped design, build, and manage some of the largest scientific computing resources in the world. During his time working with HPC systems at Argonne National Laboratory and Los Alamos National Laboratory, he worked with HPC platforms ranging in size from 100,000 to 900,000 processors. He is especially interested in turning large-scale system research into practice, and has also worked on configuration management and system management tools in the past. Cory was co-chair of LISA 2015 and is active in the large scale system engineering community.
12:30 pm–2:00 pm
Conference Luncheon
Atrium
2:00 pm–3:30 pm
Talks I
Sample Your Traffic but Keep the Good Stuff!
Ben Hartshorne, Honeycomb
The two main methods of reducing high volume instrumentation data to a manageable load are aggregation and sampling. Aggregation is well understood, but sampling remains a mystery.
We'll start by laying down the basic ground rules for sampling—what it means and how to implement the simplest methods. There are many ways to think about sampling, but with a good starting point, you gain immense flexibility. Once we have the basics of what it means to sample, we'll look at some different traffic patterns and the effect of sampling on each. When do you lose visibility into your service with simple sampling methods? What can you do about it?
Given the patterns of traffic in a modern web infrastructure, there are some solid methods to change how you think about sampling in a way that lets you keep visibility into the most important parts of your infrastructure while maintaining the benefits of transmitting only a portion of your volume to your instrumentation service.
Taking it a step further, you can push these sampling methods beyond their expected boundaries by using feedback from your service and its volume to affect your sampling rates! Your application knows best how the traffic flowing through it varies; allowing it to decide how to sample the instrumentation can give you the ability to reduce total throughput by an order of magnitude while still maintaining the necessary visibility into the parts of the system that matter most.
I'll finish by bringing up some examples of dynamic sampling in our own infrastructure and talk about how it lets us see individual events of interest while keeping only 1/1000th of the overall traffic.
Ben Hartshorne[node:field-speakers-institution]
Ben Hartshorne has been looking for the needles in a haystack of servers for a decade. He has finally figured out that (with the right kind of needles) magnets can make a huge difference. Observability tools may look different but have mostly kept the same leftover ideas in the last 15 years—it's time to shake it up. Ben joined Honeycomb.io to bring some of the tools the big companies have to the rest of us and help every engineer build better products, sleep better, and resolve problems faster.
Where's the Kaboom? There Was Supposed to Be an Earth-Shattering Kaboom!
David Blank-Edelman
Let's face it. We are great at building things—systems, services, infrastructures—you name it. But we are terrible, absolutely terrible, at decommissioning, demolishing, or destroying these same things in any sort of principled way. We spend so much time focused on how to construct systems that when it comes time to do the dance of destruction we are at a loss. We are even worse at building systems that will later be easy to destroy.
But it doesn't have to be this way. When they take down a bridge, a building, or even your bathroom before a renovation, things just don't get ripped out willy-nilly (hopefully). There are methods, best practices, and lots of lots of careful work being brought to bear in these situations. There are people who demolish stuff for a living, let's see what we can learn from them to take back to our own practice. Come to this talk not just for the explosions (and oh, yes, there will be explosions), but also to explore an important part of your work that never gets talked about: the kaboom.
David Blank-Edelman
David is one of the co-founders of the now global set of SREcon conferences. He has over thirty years of experience in the systems administration/DevOps/SRE field in large multiplatform environments and is the author of the O'Reilly Otter book. David is honored to serve on the USENIX Board of Directors where he helps to organize and engineer conferences like LISA and SREcon. He prefers to pronounce Evangelist with a hard 'g'.
Debugging at Scale Using Elastic and Machine Learning
Mohit Suley, Microsoft
Engineers are well-tuned with debugging issues on a single machine. However, when the architecture scales out to possibly hundreds or thousands of machines with components 10+ layers deep, debugging doesn't look the same anymore. The concept of looking at logs becomes 'collective' in nature and looking for patterns in logs is the only viable way of associating them with the problems you are trying to solve.
We will walk through motivation for building such a system and how it differs from traditional monitoring and debugging. A system designed this way collects all needed artifacts, identifies known/unknown patterns in error messages, correlates with infrastructure serving these errors, and allows outlier service components to be exposed within 10-15 minutes of a developing problem trend.
Mohit Suley, Microsoft
Mohit is an Availability Engineer on Bing's Live Site Engineering team. By day, he investigates all the issues that subtly affect Bing’s availability and performance. Designing systems to proactively improve availability, route around problems, is a core mission of the team. He loves long walks, talking about end-user availability and how network-level data can tell interesting stories about customer experience in aggregate. R is his go-to data analysis tool these days. Opportunities to dive into network flows, architecture issues or scaling problems never go ignored.
Talks II
Managing SSH Access without Managing SSH Keys
Niall Sheridan, Intercom
Everyone uses SSH to manage their production infrastructure, but it's really difficult to do a good job of managing SSH keys. Many organisations don't know how many SSH keys have access to production systems or how protected those keys are. A trusted SSH private key can be years old, unprotected by passphrase, and shared among multiple people who may not even work for you.
With some tooling and configuration SSH keys can be replaced with limited-use ephemeral certificates, issued centrally and with better access controls and automatic key expiration, solving many of the shortcomings of using SSH keys.
This talk will cover:
- Managing SSH keys: The bad parts
- Replacing SSH keys with ephemeral certificates: how & why
- Discussion of an implementation of a CA for SSH certificates
- Call for participation, showing github source
Niall Sheridan, Intercom
Niall Sheridan is an SRE on Intercom's infrastructure team. His main interests are automation, monitoring, and he loves a good post-mortem.
Calcifying Crisis Readiness
Rock Stevens, University of Maryland
No organization is immune to data breaches, insider threats, and other cyber attacks. An unprepared organization can exacerbate the impact of these threats, leading to a loss of consumer trust and confidence. In this talk, I propose a radically new training method for preparedness that fuses together concepts like the Netflix “Chaos Monkey” with U.S. military “react to contact” drills. Your organization, from technicians to C-level executives, can immediately adopt this proposal to mitigate future threats and lessen the effect of successful attacks.
Rock Stevens, University of Maryland
Rock Stevens is a lifelong student of information technology, earning his first certification as a network administrator at the age of 15. He is a U.S. Army Cyber Officer and served as a Madison Policy Forum Military-Business Cybersecurity Fellow in 2015. He is actively pursuing a Ph.D. in Computer Science from the University of Maryland and M.A. in National Security and Strategic Studies from the United States Naval War College.
Wait for Us! Evolving On-Call as Your Company Grows
Christopher Hoey, Datadog
The talk will start with a quick overview of the rapid growth Datadog experienced and the resulting challenges. This is done to illustrate the eventual challenges where a simple primary and secondary on-call team starts to fall apart.
In hindsight the signs are obvious however in the thick of it all it is hard to step back and realize the on-call team and processes were falling apart. It should be said that what was in place worked and met its needs for a long time. You have to start somewhere. The evolution is what I focus on while sharing the tricks to make that evolution easier.
The talk will then go into some of the patterns Datadog found useful such as refining our incident management processes and roles, growing the depth of the oncall team, eventually switching to per team rotations and the challenges involved through this evolution.
We will highlight some of the useful tricks and tools Datadog have used such as:
- Structured service templates to help with on-call training
- On-call training and shadow ops rotations
- The use of Github Issues to track on-call tasks for handoff and to use as training examples
- Scheduled on-call handoffs that include systematically reviewing the sources of alerts to kill noise
- Providing a way to capture monitor feedback from every alert notification
- Patterns of using Github projects to track where each on-call member stands as far as service training
- Scripts to use in conjunction with the service templates and on-call scheduling to show each on-call member a list of what changed since the last time they were on-call
Christopher Hoey, Datadog
Christopher Hoey currently leads the SRE Team at Datadog. Prior he was Director of Engineering, Operations at Mortar Data and Senior IT Manager at Amplify. Chris is a seasoned veteran having ridden the growth roller coaster numerous times while leading the operations teams that keep things running smoothly.
Outside of work Chris enjoys spending time with his family, riding downhill mountain bikes and tinkering on projects like open source telemetry systems.
Mini Tutorials I
Security Automation for Containers and VMs with OpenSCAP
Martin Preisler and Marek Haicman, Red Hat, Inc.
SCAP is a set of specifications related to security automation. SCAP is used to improve security posture - hardening and finding vulnerabilities—as well as regulatory reasons. It is heavily used in government, defense, and finance industries. OpenSCAP is an open source implementation of the SCAP standard. The project and its various integrations allow automated scanning of large infrastructures.
The core focus of this mini-tutorial is how to do an SCAP evaluation of containers and virtual machines that are part of infrastructures deployed in production. There are two major use-cases of SCAP, both covered by our tutorial.
In the first part, we will look at scanning machines for known vulnerabilities. We will show how CVE and CVE OVAL content relate to each other. For a demo we will show vulnerability scanning of Red Hat Enterprise Linux 7 and OpenSUSE from the command-line.
In the second part, we will focus on ensuring a system is configured according to a predefined policy (i.e. compliance). This tutorial part will start with scan of single machine for compliance with one of the profiles in SCAP Security Guide. For demonstration purposes we will use PCI-DSS but the same workflow works for any profile. Customizing SCAP content to better fit the needs will follow—selecting extra rules, unchecking unsuitable rules and altering values. Using customized SCAP content, we will perform scan of bare machine, virtual machine, and container. Then we will discuss ways to scan multiple targets continuously using Satellite 6.
If time permits, we will discuss how to write new custom content using SCE—Script Check Engine.
Martin Preisler, Red Hat, Inc.
Martin Preisler works as a Software Engineer at Red Hat, Inc. He is working in the Security Technologies team, focusing on security compliance using Security Content Automation Protocol. He is the principal author of SCAP Workbench, a frequent contributor to OpenSCAP and SCAP Security Guide, and a contributor to the SCAP standard specifications. Outside of work he likes playing guitar, skiing, billiards and indoor climbing.
Marek Haicman, Red Hat, Inc.
Marek Haicman works as Quality Engineer at Red Hat, Inc. He is lead Quality Engineer of the SCAP domain in RHEL QE, working in downstream and upstream of SCAP project. Apart of catching computer bugs, he enjoys boxing and dragon boat racing.
Mini Tutorials II
Getting Started with Bash on Windows 10 and How to Apply It to DevOps
Jessica Deen, Microsoft
Windows 10 now provides developers and IT Pros/ops with a familiar Bash environment. This environment will allow users to: 1. Run native Linux binaries including grep, sed, and awk 2. Navigate a new Linux based file system using these commands 3. Run bash shell scripts which rely on supported command line utilities Windows accomplishes this through the Windows Subsystem for Linux which allows Ubuntu user-mode binaries provided by Canonical to run on Windows 10. This means that the command line utilities are the same as those that run within a native Ubuntu environment. In this session we will showcase scripting, code editing / compilation, and execution of X11 apps compiled for Linux using a local X11 server from within the Bash on Ubuntu on Windows environment. We will then discuss the implications of these features as they relate to existing developer workflows. This will include a demonstration showcasing compilation of various programs using python, c++, and ruby. We will also include a demonstration showing how to edit and push open scripts to a Github repo from within Visual Studio Code using Bash on Ubuntu on Windows as an integrated terminal. Finally, we will show how you can use this new tool and apply it to standard DevOps practices.
Jessica Deen, Microsoft
Jessica is a Technical Evangelist for Microsoft focusing on Azure, Infrastructure, cloud and OSS. Prior to joining Microsoft, she spent over a decade as an IT Consultant / Systems Administrator for various corporate and enterprise environments, catering to end users and IT professionals in the San Francisco Bay Area. Jessica holds three Microsoft Certifications (MCP, MSTS, Azure Infrastructure), 3 CompTIA certifications (A+, Network+, and Security+), 4 Apple Certifications, and is a former 4-year Microsoft Most Valuable Professional for Windows and Devices for IT. In 2013, she also achieved her FEMA certification from the U.S Department of Homeland Security, which recognizes her leadership and influence abilities during times of crisis and emergency.
3:30 pm–4:00 pm
Break with Refreshments
Grand Ballroom Foyer
4:00 pm–5:30 pm
Closing Plenary
System Crash, Plane Crash: Lessons from Commercial Aviation and Other Engineering Fields
Jon Kuroda, University of California, Berkeley
Commercial aviation, civil and structural engineering, emergency medicine, and the nuclear power industry all have hard-earned lessons gained over their respective histories, histories that stretch back decades or even centuries. Often acquired at a bloody cost, these experiences led to the development of environments typified by stringent regulation, strict test and design protocols, and demanding training and education requirements—all driven by a need to minimize loss of life.
In stark contrast, the computer industry in general and systems administration specifically have developed in a relatively unrestricted environment, largely free, outside of a few niche fields, from the regulation and external control seen in life-safety critical fields.
However, despite these major differences, these far more demanding environments still have many lessons to offer systems administrators and systems designers and engineers to apply to the design, development, and operation of computing systems.
We will look at incidents ranging from Air France 447 to Three Mile Island and what we can learn from the experiences of those involved both in the incidents and the subsequent investigations. We will draw parallels between our field as a whole and these other less forgiving fields in areas such as Education and Training, Monitoring, Design and Testing, Human Computer/Systems Interaction, Human Performance Factors, Organizational Culture, and Team Building.
We hope that you will take away not just a list of object lessons but also a new perspective and lens through which to view the work you do and the environment in which you do it.
Jon Kuroda, University of California, Berkeley
Jon is a sysadmin and research engineer at the Department of Electrical Engineering at the University of California, Berkeley, where he spends his days (and nights) puzzling over misbehaving Spark clusters, untangling network cable incompatibilities, debugging business process, trying to manage datacenter spaces, and still having a social life all while trying to keep up with dozens of computer science researchers. Three out of five isn't bad, right?