Handle vast amounts of data at Volvo Trucks
A big automotive manufacturer, committed to the European Green Deal, pledged to generate 50% of their revenue from non-fossil fuels, a step requiring innovative revenue streams. Key to unlocking this is data, with each factory producing thousands of terabytes of it.
Our co-founder, Axel, spearheaded the creation of a solution with Growing pAI, consolidating this data into an accessible and cost-efficient data lake. The system supports both batch and streaming processes, allowing the Group to tap into their operational data seamlessly.
With Growing pAI, the manufacturer is now harnessing data to drive innovation, develop new services, and push towards their sustainability goals. Our partnership stands as a testament to how data can transform businesses, propelling them towards a sustainable and profitable future.
Medaillion architecture with data lakehouse
The medallion architecture
Having data and a platform is not enough though to achieve their goals, we need to organise the data if as we want to avoid that it’s becoming a swamp! That is where the data lake house steps in. as it brings together the best of a data warehouse and a data lake.
Data lakes are flexible; they can handle unstructured data and storage and compute are decoupled. Data Warehouses have better structure and governance. They also have ACID Transactions (Atomic, Consistent, Independent, Durable). By combining Data Lakes and Data Warehouses the idea is to have access to the best parts of both but fewer of the limitations. The Data Lakehouse is made possible by the Delta Lake storage framework. To organize everything Bronze, silver and gold layers are setup.
Growing pAI focussed on getting all the data on the bronze layer, but yet it is important to understand the layers behind to optimize the bronze layer. Data arrived into the raw layer using Qlik replicate, IBM CDC, Event Hubs and on premise MQ systems.
Bronze layer
In the Bronze layer, the data is first ingested into the system at this point the data will be completely unvalidated. The bronze layer data will be loaded incrementally and will grow over time. In this case, it can be a combination of batch and streaming.
Although the data is kept in a mostly raw state; additional fields were be added which may be helpful later (in searching for duplicates etc). We source file name and load date, send date.
Silver layer
Upon ingestion into the silver layer, data is filtered, cleaned and augmented. This could mean the data is deduplicated, missing data is handled, incorrect data is removed or corrupted data is fixed.
Data validation rules are applied; this could be things like ensuring there are no nulls in the data, that data is unique, that data is the correct type and format and performing logical checks such as if it’s a country field checking that it’s a country that exists and is spelt correctly. We also perform checks such as whether all orders have an order date or whether an order date is always before a truck factory date.
Although that it is not always needed to start joining data together in the silver layer, it is the case in this factory, as the key insights can be drawn from bringing various systems together.
Gold layer
Going into the gold layer the data is transformed for specific use cases and business level aggregation is applied.
At this level business rules are applied. Suppose a company wants to know its preferred customers. The company has decided that a preferred customer must have spent €500,000 in the most recent calendar year. A table can be created that sums customer orders by year to answer this question.
Data in the gold layer should be stored in Delta format to make use of features like the ability to restore a previous version perhaps in the case of a processing error.
The medallion structure enables teams to get insights out of huge amounts of data, yet it comes with a costs, which needs to be carefully handled.
The challenge of Cloud costs
As organizations increasingly rely on cloud services, managing costs becomes a top priority. The Group, we recognized that our usage of Azure Storage and Databricks was growing rapidly, and without careful management, costs spiral out of control. We knew that striking a balance between innovation and cost optimization was essential to our long-term success. By
- Optimizing the spark code
- Creating transparency in running jobs, costs and their frequency
- Choosing the right storage account replication types
- Creating dashboards to control the costs over time
We could achieve a massive cost reduction of 50%. On a datalake with petabytes in size, it can make a huge difference.
From the designers and engineers who are creating the next generation of web and mobile experiences, to anyone putting a website together for the first time. We provide elegant solutions that set new standards for online publishing.
Setting up frameworks to speed up development
As a lot of various sources need to be ingested, you don’t want to repeat each configuration. Therefore we have built a python framework, where only a configuration file is needed to be filled in. The framework can be configured to work with batch or streaming cases. Databricks Autoloader was chosen to organize the streaming cases.
Other benefits of standardizing this intake method include:
- The silver layer is more unified: we always Optimize of partition on a standard date time column.
- The logging and data quality can be configured centrally. This avoids duplication of code.
- Naming conventions are automatically used
- Necessary library updates can be handled centrally.
- …
Easy and secure data access management
Having vast amounts of data in place also shouts for decent access control level permissions. You want to achieve an easy way to grant or revoke access to teams/users while keeping the robust access control in place. With the help of Microsoft, we set up a Azure AD groups structure.
What can we do for you?
In conclusion, as we stand at the precipice of data’s limitless potential, remember that the path to a thriving data lake is paved with intricate decisions and constant vigilance. The choice to collaborate with professionals who possess the skill set and know-how to navigate these waters is a choice that can propel your data-driven endeavors towards unprecedented success. So, as you embark on your journey, consider the wisdom of aligning with experts like Growing pAI who can turn your data lake dream into a resilient and impactful reality.
Massage your data
Call us for fast support to this number.
+32 473 444 882