Background: I am working as a part of the advertising team at Xperi. We get data from many sources, which we process to generate the reports. As the data comes from different systems, if source data was held up due to either technical or manual issues, reports generated based on that data received could not be used. This led to revenue loss as there was no way to recover the data. To overcome this problem, we decided to have a monitoring framework that would run some tests on processed data every day.
Goals: We must constantly monitor data pipelines to ensure good data quality.
Solution & Results: To overcome the problem of missing data, I wrote a Python script that runs tests on the data received. If all tests are not passed, we look at the data and flag issues to the concerned team so that remedial actions can be taken immediately. Jenkins was helpful in setting up the daily monitoring job and triggering it daily. Additionally, the reports generated are mailed to the stakeholders so that everyone can take a look at them. For historical purposes, the 'build status' gets captured in Jenkins, so, if needed, historical data can also be referenced. We also integrated Jenkins with our internal communicator to make it easier to check job status at any point in the dev cycle.
The key capabilities we relied on were varied. We used the Shining Panda plugin to integrate the Python code with Jenkins. We also used the Slack plugin to print the status of the job in the Slack channel. And we used Jenkins Build to periodically trigger to schedule the job to be run on a daily basis.
Jenkins forms the core of my solution. Before Jenkins, I was running the scripts manually, and on weekends had to login and run it. Now, with Jenkins jobs pre-configured, I am free to live my life while Jenkins takes care of the jobs.
We are happy with the results: