- 2 days ago
Category
🤖
TechTranscript
00:00Hey friends! Welcome to this video where I want to dive a little bit deeper into notebooks to
00:11write PySpark or Pandas code. Now I know that not everyone is a data engineer and maybe you don't
00:16have the skills to write the code yourself. This is why Fabric comes to rescue because it also
00:21allows us to do any kind of data transformations in a drag-and-drop way. And how that works,
00:26we can explore right now. For that, you can see that I went back to my Fabric trial workspace.
00:31So you just go to workspaces and then choose the specific workspace you want to use. In this case,
00:35I'm going to use this one. And now we create a new notebook. So we already covered this. So what
00:41we can do, we can either go inside new item up there and choose the notebook and then map for instance
00:46the lake house when you want to attach one or the warehouse when we want to attach one ourselves.
00:50Or if we want to skip this step, we can also go directly into the lake house we want to attach
00:56for instance this one. And then we can go to open notebook and just choose in this case a new
01:01notebook. So click on new notebook for now. And then we just open this notebook. In this case,
01:07we call this here data wrangler. Let's call it data wrangler because this is the right name for it.
01:14And then let's explore what the data wrangler actually can do for us. So now we have the new
01:19notebook. And we also can see that because we created it from the lake house, the lake house directly is
01:24attached. Same is true by the way for the warehouse as I said already before. But now we want to use
01:29lake house data for that. So click on the lake house and we can also see that there are the table
01:33section and there are also the files available. And now depending on what you want to do, we can use
01:37either PySpar or we can use pandas, right? And to load the data, we also know that we can write
01:42this ourselves the code or we could also go to the three dots and then load the data ourselves.
01:47So for instance, if I click on this orders here, I can take this option. I can say load data and then I can use
01:53spark data in here. And then I get this spark.sql and then select star from fabric.lakehouse.dbo.orders
01:59limit 1000. And this one more time, this works because we attached the lake house to this notebook.
02:04So you can read it directly from, in this case, the lake house name.schema.gablename.
02:10Okay, so this is maybe our example. So we can actually run this. So click on the run cell and then we just
02:15wait until the notebook starts. You can see here it's connecting and then we should see the result.
02:20Okay, and it's finished. And now we can see it took around two and a half minutes for me. And now I can
02:27see my data frame in here. So this is basically the data frame load. So far so good. Now next, let's
02:33actually try out the data wrangler. Now you can clearly see already the data wrangler is up there.
02:38This is one option which you can choose. And then it starts searching for data frames inside our notebook.
02:43In this case, it has already detected one data frame, which is this DF data frame.
02:48This one here. The other option, you can also see that the data wrangler is also there for this
02:53specific display. So if you go to data wrangler there, it will also guide you directly into the
02:57interface of the data wrangler. And now this is the interface for everyone who wants to use PySpark
03:03code for instance in transformations or Python transformations, but is not able to write
03:07the code or does not want to write the code. So either way, now for that we have the data wrangler in
03:13Microsoft fabric. Now you can see that you've got a preview of the data. You can scroll through it,
03:16but this is exactly what we've seen before. And you can also get here some kind of, well,
03:21statistics regarding the columns you have inside. But much more interesting, of course, are the
03:27transformations. And for instance, what we could do is we could use some of them right now. For instance,
03:33here we have, let's see, different kinds of columns. That's fine. And here, for instance,
03:37we have a region, which is America. Now in Europe, these are the two regions. Let's say instead of
03:42America, we would like to have Americas. So the plural of that. So we could go to the region section,
03:50first select the column, then go to find and replace. And then we say we would like to find
03:55and replace. And inside the region, the old value is, for instance, America. And the new value is
04:02Americas. Americas, like that. So then we can also specify, you want to match the full string,
04:09or match case, or use regular expressions. I want to get match the full string. I select this option
04:13here. And then we basically could go. So a region, America, you can see that there is the old one,
04:19which is highlighted in red. And there is the new one, which is highlighted in green. And we can clearly
04:25see the change. So now there is an S for America. And if you like this change, of course, you can apply it.
04:30Otherwise, of course, you can discard it. But let's say I like to apply it. But before I click this,
04:34please have a look at what is created up down there. So I'm going to replace it is here. Well,
04:40converted this first into pandas. And it shows us here the code, which is generated. So in this case,
04:45the block pandas region string of lower America, and this converts to Americas. So that's happened here.
04:50Please also note this little hint here, your spark data frame has been converted into a pandas
04:56sample for performance. But all the code generated by data wrangler will be converted to PySpark when
05:02you add it to the notebook. That's important. So even if you see Python code here, this will convert it
05:07into PySpark. So your data frame itself will not be converted. So it will stay, in this case,
05:11a PySpark data frame, and the transformations will then be applied on this specific PySpark data frame.
05:16So that's fine. Click on apply for now. Okay, here we go. And we've applied our first transformation,
05:22which you can find down there as well. So here the cleaning steps, this is kind of similar to what
05:26we know already from the data frame gen 2. So there also we have the applied steps window. And here we
05:32also have the cleaning steps window, which allows us also to click on the prior steps, we can see the
05:36results, and we can also delete a step. So if we say we don't need this step anymore, just delete it,
05:41and then it's gone, right? So let's click on the last operation we were, this was with the Americas.
05:47And now let's also maybe change the format, change the schema, let's say. So if you want to, whenever
05:53you want to drop columns, or add additional things, you can go to the schema option here. So go to schema,
05:58like that. And then you can see you can change the column type, clone a column, drop a column,
06:02rename a column, and so on. So let's say I'd like to drop a column. And then I don't want to drop the
06:07region. This was currently selected. Let's go in here and maybe drop other columns. Let's say I'd like to
06:13drop the customer zip code, I don't need this. And let's say I also like to drop maybe the order ID,
06:19the customer ID, the product ID, and the employee ID. So just as an example, and let's click on apply.
06:24And first, please, let's have a look at this. You also see the highlight in red. And down there,
06:29again, you can see the code. So simply, that has to drop, in this case, the column names.
06:33So if we go to apply, this step, then we should see that now the ID columns are gone. Now we have less
06:39columns in our data frame. So this job has also been applied. And you can see that five columns
06:44have been dropped. And one more time, you can still can see here the delete option and so on.
06:49That's also available for us in here. Now, of course, you find many additional transformations
06:55here. And let's have a quick look what's available. So find and replace, we have drop duplicate rows,
07:00drop missing rows, which is often used in data cleaning transformation steps, fill missing rows,
07:05file and replace we've seen. Under format, we find various options to capitalize letters to convert
07:10text to lowercase, uppercase, and so on. We have the option to split text and strip white space.
07:15Under formulas, we have the option to binarize labels to one-hot encode labels, which is important
07:20to transformation for machine learning, which is also available. We can calculate text length.
07:25Under numeric, we have some options like rounding, like scaling, and so on. These options here,
07:29schema we've seen. And sort and filter is also available, like filter sort. And group by an aggregate is also
07:35available. So a lot of transformations in here, which might be quite tedious to know all the
07:40PI smartcode, but here we can simply click them together on our data frame. And as soon as we are
07:45done, all we need to do is we go to here, add code to notebook. By the way, you can also save the
07:50transformed file, in this case, the CSV file, and download it if you want. You can also copy the code
07:56to clipboard, if you want to copy, save the code stored somewhere. Or if you want to apply it, simply go to
08:01add to code to notebook. Click this option. And yes, I want to do this. Click on add in here. And then
08:09if you scroll down to the bottom, you should find here, this is the code generated. But this time,
08:15as the message already said, this is PySpark code. So even we saw the Python code in the preview,
08:22this now is PySpark code. And it creates a function. And this function is then applied to the data frame
08:27itself. And then, of course, if we do this, if I run this, so press shift enter, or press the run
08:33command here, and just wait until we get the output, we should see that after applying these,
08:38we have dropped the columns. And we should also see America in our data set, Americas to be specific.
08:43So you see here, see, now it starts with order date, which is here, if I scroll up here, the order
08:48date actually was this column, but we remove these columns right in the data render. And if I scroll down
08:53further in here, let's go further down there, you can see here, Americas, the region has also been
08:59applied to step. So this is how the data render works. It's quite great. It's a great feature,
09:04because it extends the functionality of the notebooks. And it also allows users who might not be familiar
09:09with PySpark or with Python code, or who don't want to write it to quickly transform the data. And you
09:15can also go back in the data render here, if you click this option, right, then you can actually do new
09:19transformations here as well. So if you want to do this. So let's go back. This was just for this
09:27case, the PySpark code. But I just want to highlight this also, it's also available for pandas. So that
09:33just means if I go to my lake house here, and I load another table, so this time from the file section,
09:39if I click on files, and then for instance, I like to use the Pokemon file here, right, so Pokemon update
09:44to be specific, I can click three dots, and I load the data, but this time not to spark, but instead I
09:50load it to pandas. So if I click on pandas, and then maybe let's at least rename this. So let's say
09:56dfp for because it's pandas in this case, and then display dfp for like that. Okay, here we go. And
10:05let's run this. And then we just wait. And now this time we see another, well, table. But this time,
10:10of course, it's pandas. Now this time is a Python pandas data frame. This is not a PySpark data frame.
10:16But the same is true here as well. The data wrangler works no matter if you have a Python
10:21data frame, so pandas data frame, or you have a PySpark data frame. So if I go to data wrangler in here,
10:26I can do exactly the same transformations. So the data is loaded, as you can see here,
10:30I got my Pokemon tables, Pokemon table in here, and I can do my transformations here. So if for instance,
10:35I say there is normal, poison, and so on, find and replace, we'll also try that. So let's actually
10:40maybe catalyze a letter or make everything lowercase. So let's say I use type 1 in here,
10:46and then I want to actually convert this to lowercase. I can then say I would like to convert
10:50text to lowercase. I can take this option. Type 1 is already selected. You can of course choose
10:55several columns, type 2 and so on. Of course, here, Fabric is smart enough to type out only the string
11:01columns because we cannot convert a numerical type, right? So that's why it only offers these
11:05options in here. But that's fine. Let's click on apply. We also see the preview one more time.
11:10We click the apply option, and then at this time, this tab is applied. And the same works for other
11:16transformations for now. I think that's fine. For an example, I think you get the point,
11:20and if you're done with that, you can add the code to notebook. Click on add to notebook,
11:25and if I scroll now to the bottom, it works exactly the same way. We have a function which is created.
11:30This time, this, of course, is normal. So in quote, normal Python code, it's not
11:35PySpark code, but it's supplied with a clean data here function also to DFP. And now if we execute this
11:41and just see the results, we should see that now we have here, in this case, we only have the add,
11:46but now we have here the transformation. And you can see that type 1 here is lowercase, right?
11:51So it works exactly the same way. So no matter if you have a PySpark data frame or if you have a
11:56panels data frame, you can use the data wrangler in order to do the transformations. So they're
12:00heavy lifting without writing a single line of code yourself. And after doing this, then of course,
12:06you can write this back either into the file section, if you want to store these files,
12:10or also, of course, in the table section, if you want to save it as delta tables. And we have already
12:14seen how to do this. So that's it for the data wrangler. Hopefully, you enjoyed it. Hopefully,
12:18you understood how it works and what we did. If you've got questions, let me know. Otherwise,
12:22give it a try yourself. And hopefully, see you in the next video. Until then. Peace, guys.
Recommended
1:16:45
|
Up next
12:23
0:36
0:33
13:32
24:07
0:46
Be the first to comment