Loading…
DevConf.CZ 2020 has ended
ML / AI / Big Data [clear filter]
Friday, January 24
 

10:30am CET

What is Machine Learning?
Machine Learning applications in industry have exploded in the last decade. While many of the algorithms are old, the availability of cheap and fast compute, cheap storage as well as relevant data, has made it much easier to train useful machine learning models. But what is machine learning exactly? This talk is aimed at people who have heard the term multiple times but are unsure what it means. The talk will not cover the details of algorithms and their applications but will instead focus on the scientific foundations of machine learning.

Speakers
avatar for Sanjay Arora

Sanjay Arora

Data Scientist
Data scientist at Red Hat



Friday January 24, 2020 10:30am - 11:55am CET
A112 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia
 
Saturday, January 25
 

10:00am CET

How to Design Feature Vectors for Model Inputs and
Statistical learning can and should be applied to many tasks and situations. This hot topic is covered in many talks and classes which talk about machine learning models and the impressive results they can achieve. This is step two, though, the first being able to create datasets which can be used for training. It is not just about the quality and data used but also the representation. An ill-chosen representation can model convergence slow or even impossible, making the model potentially unusuable. The same applies to the representation of the model output, not all output formats are the same.
This talk will talk about the problems with the representation of features and results. The effects of bad choice are shown as well as examples from a number of different problem areas which will show how (sometimes) creative the data scientist has to be to produce a well-performing model.

Speakers
avatar for Ulrich Drepper

Ulrich Drepper

System Research & Data Science, CTO Office, Red Hat
Data Scientist, CTO Office
avatar for Sanjay Arora

Sanjay Arora

Data Scientist
Data scientist at Red Hat



Saturday January 25, 2020 10:00am - 10:55am CET
A112 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia

11:00am CET

Data cleaning: when less is more
In today's ML world we are gathering and analyzing an enormous amount of data. But how to deal when there is too much information, i.e. too many variables? We can use grid search that will select variables for us, but this process is very computationally expensive. In my talk I will show various strategies for variable selection and how to combine them into data cleaning pipelines. I will cover univariate variable selection, PCA (Principal Component Analysis) and penalization regression technique. This talk will give you practical tips on how to get the most of your data instead of getting lost in variables.

Speakers
avatar for Anastazie Sedláková

Anastazie Sedláková

Data scientist, Freelancer
I am data scientist, programming courses lecturer and mom of two. Together with my husband, we are organizing programming courses (sedlakovi.org). My background is in statistical genetics. During my PhD, I learned to program and then changed the field completely - to work with financial... Read More →



Saturday January 25, 2020 11:00am - 11:55am CET
A112 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia

12:00pm CET

package2vec: getting to know PyPI packages with ML
Recently, Bommarito et al. released the paper “An Empirical Analysis of the Python Package Index (PyPI)” that explores many interesting statistics concerning the Python ecosystem. Can we use machine learning to go beyond pure statistics? This session will discuss how various SOTA Natural Language Processing and Graph Neural Network techniques can be applied to give new insights into packages on PyPI. Specifically, we will detail our approaches to embedding Python packages into learned vector spaces to reveal package similarity and topics within PyPI. In addition, we will discuss the potential applications and benefits of having these learned representations in the context of package recommendations for developers.

Speakers
DD

Devin de Hueck

AI Data Engineering Intern, Red Hat
Interested in all things ML



Saturday January 25, 2020 12:00pm - 12:55pm CET
A112 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia

1:00pm CET

ML with real impact driven by OS technologies
Big data and machine learning have become hot topics in the last few years. But the question is: How can we use them to solve a real issue in our lives? Let us take you on a journey through the world of data science and demystify the hype behind it by showing you how to achieve a real impact with interpretable and easy to understand models.

In this presentation, we will introduce the open-source technologies we use to create machine learning solutions with a real impact. We are passionate about technologies like Spark, Delta Lake, MLlib and MLflow and we would like to share with you what we have learned and why we use them. The talk will also feature a live coding demonstration to show how to start a project with the use of these technologies. We will also cover how to use them to solve real-life issues, such as the ones we have run into while working on our projects.

Speakers
avatar for Nikola Valesova

Nikola Valesova

Data Scientist, DataSentics a.s.
A full-time data scientist with a passion for (not only) machine learning. As a recent FIT BUT graduate, I have a deep technical understanding of computer science and throughout my studies, I became keen on image processing, AI and data science. During my internships at Red Hat and... Read More →
avatar for Tomas Kresal

Tomas Kresal

Trust me I am Engineer, Datasentics a.s.
I started my professional career almost ten years ago as an SW engineer in company Seznam.cz. There I found a passion for leadership, big-data, machine learning, and opensource. Most of my time in Seznam, I was a part of the search engine, and I was lucky enough to become Head of... Read More →


custom css

Saturday January 25, 2020 1:00pm - 1:55pm CET
A112 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia

3:00pm CET

Data analytics with distributed tracing data
Modern observability systems can be seen as a platform providing an executive view on the system and an interface where users can ask questions about system behaviour. However these questions are sometimes complex and require data aggregation, feature extraction or running a machine learning algorithm.

Come to this talk to learn about data analytics pipeline based on Jupyterlab, Apache Spark and Kafka integrated with distributed tracing system Jaeger. We will explore how the platform is integrated into Jaeger and what benefits it provides to devops engineers and data scientists. There will be a brief introduction to distributed tracing, Jaeger system and then we will run a live demo with analytics stack deployed on OpenShift and try to answer questions about a monitored application.

Speakers
avatar for Pavol Loffay

Pavol Loffay

Principal Software Engineer, Red Hat
Pavol Loffay is a principal software engineer at Red Hat working on open-source observability technology for modern cloud-native applications. Pavol contributes and maintains Cloud Native Computing Foundation (CNCF) projects OpenTelemetry and Jaeger. In his free time, Pavol likes... Read More →



Saturday January 25, 2020 3:00pm - 3:25pm CET
E105 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia

3:30pm CET

Apache Spark on planet scale
Apache Spark is an open-source distributed general-purpose cluster-computing framework with implicit data parallelism. OpenStreetMap is a huge database of features, found on Earth surface. Working with that database is hard, so Spark is a natural solution to solve OSM size-caused processing issues. I'm going to show how to load OSM data to Spark, run processing algorithms like extract/merge or render and how using Spark improves development process and cuts processing times greatly.

Speakers
DC

Denis Chaplygin

Software engineer, Wolt enterprise Oy
Software engineer at Wolt working in logistics area. ex RedHatter



Saturday January 25, 2020 3:30pm - 4:25pm CET
E105 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia

4:30pm CET

Notebook showdown: Jupyterhub vs. apache zeppelin
Nearly everyone working in the data science industry knows about or has heard of Jupyterhub. It is an essential tool for data exploration and visualization, and nearly anything is possible as it provides an immediate code-based interface to popular data science frameworks. But did you know there is another contender that can do all of this as well?

Enter Apache Zeppelin! In this talk you will learn about the Apache Zeppelin project and how it is different from Jupyterhub. Ricardo will demonstrate the strengths of both platforms and show you how to work effectively and creatively on either. You will leave this talk with a better understanding of the notebook landscape and be more informed about which platform will best serve your needs.

Speakers
avatar for Ricardo Oliveira

Ricardo Oliveira

Kubeflow contributor, Red Hat
Ricardo has worked as a Senior Software Engineer at Red Hat for 8 years, totaling 13 years working for the same company. Since 2009, he has been working on Open Source projects focused on data and AI/ML, among them Open Data Hub and Kubeflow. Graduated in Computer Science, he seeks... Read More →



Saturday January 25, 2020 4:30pm - 5:25pm CET
E105 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia

5:00pm CET

Distributed data workflows: PySpark vs Dask
Until very recently, Apache Spark was the de facto choice of framework for batch data processing at scale. For Python (or new) developers, diving into Spark is challenging, as it requires learning the Java infrastructure, memory and configuration management. The multiple layers of indirection also make it harder to debug errors, especially when dealing with the PySpark API.

With Dask, a pure Python framework for parallel computing, Python developers have now an intuitive and elaborate way of building scalable data pipelines. In this talk, we'll be using a data aggregation use-case to highlight the important differences between the two frameworks, and make it clear the overall benefits of moving from one framework to other.

By the end of the talk, developers/ data engineers/ scientists, would have a framework and benchmarks to refer to, to make an informed decision while building their production Data Engineering pipelines.

Speakers
avatar for Vaibhav Srivastav

Vaibhav Srivastav

Data Scientist, Deloitte GmbH
I am a Data Scientist and a Master's Candidate - Computational Linguistics at Universität Stuttgart. I am currently researching on Speech, Language and Vision methods for extracting value out of unstructured data.In my previous stint with Deloitte Consulting LLP, I worked with Fortune... Read More →



Saturday January 25, 2020 5:00pm - 5:25pm CET
D0207 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia
 
Sunday, January 26
 

3:30pm CET

Does using data mean giving up privacy?
Deep learning and machine learning more broadly depend on large quantities of data to develop accurate predictive models. In areas such as medical research, sharing data among institutions can lead to even greater value. However, data often includes personally identifiable information that we may not want to (or even be legally allowed to) share with others. Traditional anonymization techniques only help to some degree.

In this talk, Red Hat's Gordon Haff will share with you the active research activity taking place in academia and elsewhere into techniques such as multi-party computation and homomorphic encryption. The goal of this research is to enable broad information sharing leading to better models while preserving the anonymity of individual data points.

Speakers
avatar for Gordon Haff

Gordon Haff

Principal, BitMasons
Gordon Haff is Principal Analyst at BitMasons where he writes and consults with an emphasis on open source and computing infrastructure. At Red Hat, he worked on market insights and portfolio architectures and wrote about tech, trends, and their business impact. His books include... Read More →



Sunday January 26, 2020 3:30pm - 3:55pm CET
D0207 Faculty of Information Technology Brno University of Technology, Božetěchova, Brno-Královo Pole, Czechia
 
Filter sessions
Apply filters to sessions.