Skip to playerSkip to main content
  • 2 days ago
Transcript
00:00Hey friends, welcome back! In this video we dive deeper into notebooks and fabrics, so we cover more the data engineering side of Microsoft Fabric.
00:14To get started with that, we have various options. The most intuitive one probably is to go to new item, and under new item we will find notebooks.
00:23We can just use the filter. If I go to notebook up here, you find it is or can be used to get data into fabric.
00:30We can use it to prepare data, so for cleaning data, for reshaping data, right? We can do this, and we can also of course use it to analyze data or also to train data.
00:39Training data is then rather referring to all kinds of data science aspects of fabric.
00:45But you see that notebooks are used quite frequently because they have very many use cases.
00:50So, we could choose either one of those notebooks, this would be one option.
00:54But before we do this, let me also show you another option.
00:57The other option, for instance, to create a notebook would also be if we go inside one of the lake houses, an example, click on lake house.
01:05You'll find that there's an option up here, open notebook, and this allows us either to open an existing notebook or also to create a new notebook.
01:14The main difference is when we create the notebook from here, then the lake house, which we're currently in, is attached directly to the notebook, meaning the notebook can easily reference data tables or also files from this specific lake house and use them and then transform the data inside those files and data tables.
01:32There's a difference.
01:33There's a difference.
01:34If we, on the other hand, go back to our workspace and use the new item option up here and choose to create a notebook from here, from this section, then we would have to manually attach a lake house.
01:46If we want to do this, of course.
01:47That's not required, but we could do it.
01:49So, let me show you this way.
01:51So, let's go in here and choose notebook.
01:54And then it doesn't matter which of those options you choose, just choose one of them.
01:58So, I choose the first one.
01:59I use notebook.
02:00And then we just wait and now the notebook is created.
02:03By default, it is always the name notebook1, but you can click up here and choose a different name if you want to do that.
02:09So, for now, let me just take a look at this.
02:13So, what we have is we have a code cell in here.
02:15By default, when we create a notebook like that, the code itself is PySpark.
02:20But you can also change it for each cell, by the way.
02:24If you click on the drop-down menu here and you find the currently supported programming languages as well as HTML.
02:31I wouldn't consider this as a language itself, a programming language, but it's still available also in here.
02:35So, you can write either Python code, so in PySpark, you can use Scala, you can use Spark SQL, Spark R, and also HTML.
02:43So, these are currently supported, which you can use in order to write your code in here.
02:48The thing which I mentioned at the beginning is currently no data source is attached.
02:53So, no lake house or no warehouse.
02:55And this is because we created the notebook from the new item section.
02:59This means that currently we have here the option under Explorer here to add a data source.
03:05So, either resources, we can upload data here for the lake house, for the notebook, which can be used.
03:11For instance, code snippets in Python, anything like that, modules.
03:15But you can also now attach one of those options here, either lake house or warehouses.
03:20So, currently nothing is attached, but I could go here, click on lake houses, and then I can click on the plus symbol here, plus.
03:27And then I can choose either a new lake house, which can be created, or I can use an existing lake house with schema or without schema.
03:34In this case, I created my lake house with schema.
03:37So, that's why I choose existing lake house with schema.
03:40Remember, the DBO section, which I used.
03:43So, click on that.
03:44And then I can see different kinds of lake houses I have access to.
03:47Now, of course, if you have access to several workspaces, you can also link a lake house from a different workspace, if you want to do that.
03:54But for me now, I would like to use the one which we have created together in Fabric Trial.
03:59So, I choose this lake house option, take the option, click on add.
04:03So, just a second, and now we can see here the tables, as well as if we click on files, we can see the files in here.
04:10So, this just means that now it becomes easier, actually, to get the data directly in here in the notebook cell and run it.
04:18As an example, if I want to now import one of those tables, because I want to analyze it and maybe also then clean it further and preprocess it,
04:26I can simply either write the code myself or, even easier, I can go in here for the Delta table section, for instance.
04:33I can go to my Pogman update, as an example, for one of those, click on the three dots, and then there's an option to load data.
04:39Hopefully, you can see that, there it is.
04:41And this means I can load directly the data using Spark.
04:45So, if I click this option, like that, then you see that a new code cell is inserted, which is simply then spark.sql and then the select statement.
04:53So, select star from Fabric Lakehouse, DPO, Pogman update limit 1000.
04:57And the reason why this can be used here as a data source is because the Lakehouse is attached.
05:02So, it's quite easy to reference, actually, specific Delta tables or also files from here and use them inside this code cell and then run it.
05:11And my data is loaded. It took me two and a half minutes. Normally, it should work faster.
05:15But for now, this was the time. And now we can see here, this is now my data frame.
05:20And data frame now tells me I have a table here. I can also actually create a new chart if I want to do that.
05:25So, you can create charts on the data frame. Either this one was created automatically, but you can also say you want to edit this one.
05:32You also have here various options, which you can choose. For instance, if I choose total, this is total.
05:37If I choose attack, I get attack. Or I can also completely build this on my own.
05:41So, if I go build on my own, I have the option here to choose what kind of chart I want.
05:46So, let's say, for instance, a column chart, I can add a title, a subtitle, and then I can add data on the axis.
05:52So, for instance, type 1 on the x-axis and then for the y-axis, I could say I'd like to have the attack.
05:58Click on attack and then you can see that now I have my chart created, right?
06:02So, that would also be an option. And of course, there are further options you can use.
06:06For instance, aggregating the data stacked and so on. And there's also, next to basic, there's an advanced section,
06:11which allows you to further customize this. So, you can create a chart like that if you want to do this.
06:15It's not required. It's just an option, which you can also include in your, in this case, PySpark notebook.
06:21All right. It's a nice extension, which is also available right now.
06:24So, if we go back to the table option, and there is the table itself.
06:27So, now, of course, we could, which is more interesting, especially for the data engineers,
06:32we could do the data transformations in PySpark, right?
06:35Or, as I said, in other kinds of languages, if you want to choose this.
06:39But for now, let's just choose a few simple hexadecodes.
06:41Let's say, in this case, df is equal to, and then let's just add a new column.
06:45Let's say, df.with column. And there you can also see you have IntelliSense here.
06:50So, you don't need to know all the PySpark code yourself.
06:53You also can leverage the options which are built-in in here.
06:56So, I say, df.with column. I can choose this option here.
07:00And then I can say, okay, what is the new column name?
07:02Let's say, in this case, the name is new.
07:04And then what I want to do, let's say, I'd like to aggregate two columns,
07:08like the sum, for instance, of total and hb, right?
07:11I could add those two numbers just to have one additional column, as an example.
07:15So, then I can leverage the sum function here.
07:17And the sum function then needs a list.
07:19So, you open brackets here and say, then df.
07:22And then for df, I'm using the total column.
07:25The great thing is that this is also here proposed from IntelliSense.
07:29So, I just need to press my tab here just to select it.
07:32So, I don't need to write it myself.
07:34And then as a second argument, let's go on here.
07:36Let's say, for instance, okay, I also like to use from df.
07:39I like to use the hp. So, this one here.
07:43And hopefully, the parentheses are correct.
07:46So, let's just check that.
07:48Let's go on here and say, we'd like to show that.
07:50So, df.show, as an example.
07:53And we can run this, the show command.
07:55And then just wait a second.
07:56And we should actually get an output here.
07:58So, it looks good.
08:00So, obviously, the formula was correct.
08:02So, no parentheses or brackets, please run.
08:06And there is the data itself.
08:08So, this is basically the show command.
08:10By the way, if you also want to have this nice-looking table,
08:12you just replace the show with display.
08:14So, if you just...
08:15Let me go to a new column for now.
08:17In here and a new row.
08:19I mean, new cell.
08:20And let's say, in this case, display is the command.
08:24So, display df.
08:25If you would use this instead, instead of the show command,
08:27you would probably get a proper table formatted like that.
08:29Right?
08:30As we've seen before.
08:31So, that would be the other option.
08:33So, maybe let's do one additional transformation.
08:36And then we write the data back.
08:38So, let me enter a new cell.
08:40You can, by the way, if you click on a cell, the last cell,
08:43and you click the B on your keyboard, you just enter a new code cell.
08:47If you want to do this.
08:48So, in that, let's say, df is equal to df.select.
08:52Let's just clean our identifier a little bit.
08:55So, just use a few of those columns.
08:57Because those are probably too much right now.
08:59And then, again, we use a list.
09:00And then you're choosing the name column.
09:02I'd like to have the type 1.
09:04Type 1.
09:05And maybe also like to have the total.
09:08And then I also might be like to have the new.
09:12Right?
09:13We create a new column here.
09:14So, let's just check with this actually in place.
09:16Just to mention it here.
09:17There's our new column, by the way.
09:19465.
09:20So, 465 is simply 405 plus 60.
09:23Right?
09:24So, the sum of those two is exactly what we created in here.
09:27So, now let's just choose those three columns or those four columns from the original one.
09:32And one more time, let's display that.
09:33Just to see that we actually really adjusted the data frame.
09:37So, we selected the relevant columns for us.
09:39So, let's say we displayed the f.
09:41And then just wait until this data is executed.
09:44Now we are.
09:45And you can see that now we have name, type 1, total, and new.
09:48And the other columns are now gone.
09:49Right?
09:50Of course, you could also not only save this in the same data frame.
09:53You can also create a new data frame.
09:54So, for instance, df2 equals df and so on.
09:56Right?
09:57You don't have to override your specific data frame.
09:59That's just an example in here.
10:01And finally, of course, the most interesting part is after we are doing all the heavy cleaning
10:06and transformation.
10:07Right?
10:08All these steps.
10:09Then, of course, we would like to write the data back inside our lake house.
10:12And, of course, in best case, as a delta table.
10:16Because a delta table, at the end of the day, then could be also used in a semantic model,
10:20which then could feed the power variables.
10:22Right?
10:23So, this is the main idea.
10:24So, how do we get now any kind of transformed data frame, which we have created and cleaned
10:28inside our notebook, back into a fabric and into the table section as a delta table.
10:33This can be done also using place bar.
10:35So, what we need to do for that is we need to write the data back.
10:39And this can be done.
10:40Let me enter a new cell.
10:41Press B on your keyboard.
10:42And then go on here and say DF.
10:44In this case, .write, because we want to write the data back.
10:48Then .mode.
10:49The mode can specify how we want to write the data.
10:52Also, you get here additional information.
10:54But more or less, we actually have normally the two options, overwrite and append.
10:58These are the most often used.
11:00And overwrite means that if the table already exists with the specific name which we gave it,
11:04then, of course, we would overwrite it.
11:06Otherwise, if you would like to append data, we can also do this.
11:09So, if the table which we want to write the data to, the delta table, already exists up here,
11:14we can also append data to it.
11:16So, in this case, the data does not exist already.
11:19So, it does not matter which option we choose, but I would like to mention this.
11:22So, use append or overwrite.
11:25As I said, it's up to you.
11:26Then .format, and the format we want to save the data is delta.
11:31Because Power BI, in this case, or fabric, to be specific, is, of course,
11:36the underlying format is delta for the table section.
11:38So, everything is stored in delta tables.
11:40So, for format, we just specify here delta.
11:43And then, finally, of course, we save it.
11:47And we have two options.
11:48We can either use the save command, or there's also the save as table.
11:51So, the difference is, if we use the save command, you would also have to write tables slash something.
11:56And if you use the save as table, you don't have to do this.
11:59You just need to give it a name.
12:00For instance, this is the Spark Pokemon.
12:04So, put it in quotes.
12:05So, Spark Pokemon, and then update, like this.
12:11And, of course, you can give it any name, but that's basically it.
12:14And then we can execute the cell, either by clicking this run command, or pressing Shift-Enter.
12:19You can also do this, and then the cell gets executed.
12:22And then we write this data, specifically this specific data frame, back into the lake house as a delta table.
12:28So, you've seen that it has been succeeded.
12:30So, let's just check.
12:31And currently, I can't see it.
12:32So, let's click on the three dots here, and then click on refresh, and then we should see it.
12:37And there it is.
12:38Right?
12:39So, this table could now be used either in the lake house itself, or, of course, in Power BI.
12:44So, that's basically it for an introduction to a Pi Spark notebook.
12:49So, this is where a lot of data engineering will spend their time.
12:52But, hopefully, you have seen that when you attach the lake house, it's quite easy to get the data directly inside the specific notebook itself.
13:00You just click on the three dots, and you can, here for a table, click on the three dots.
13:04You can load it directly inside the notebook, and then you can start with your Pi Spark transformations.
13:08Or using any other kind of supported language, of course.
13:11So, that's it for this video.
13:12If you've got questions, let me know.
13:14Otherwise, thanks for watching, and I hopefully see you in the next video.
13:17Until then, best guys.
Be the first to comment
Add your comment

Recommended